[raspi-internals] Re: QPU Tutorials/Samples

  • From: Herman Hermitage <hermanhermitage@xxxxxxxxxxx>
  • To: "raspi-internals@xxxxxxxxxxxxx" <raspi-internals@xxxxxxxxxxxxx>
  • Date: Sun, 16 Feb 2014 11:16:05 +1200

Hi Shachar!
Thankyou for your post.

> First of all, kudos on getting the QPU assembly to this level.
Thanks.

> I have been working on getting the shader_256.s example to compile. Got
> few fixes in the pipe, and now the assembler doesn't crash when
> compiling. Sent a pull request with what I have so far. I am still

Ok, I merged that.  I guess I should add /* ... */ support, alas as you can see 
I didnt write a real lexer...

My plan was to assemble files with:
  /* addr: word0 word1 */ assembly #comment
and do an assertion check that word0 and word1 match the assembled code.

> missing the rotator and packer support - any chance you can document
> this in the bit level?

Ok, I just pushed the latest qpudis.c

Rotator:

- if the third slot op is 13, then the 6 bits of register bank b reference 
decode into the immediate table:

const char *imm[] = {
        "0", "1", "2", "3", "4", "5", "6", "7",                                 
                                                                                
                                                                    
        "8", "9", "10", "11", "12", "13", "14", "15",                           
                                                                                
                                                                    
        "-16", "-15", "-14", "-13", "-12", "-11", "-10", "-9",                  
                                                                                
                                                                    
        "-8", "-7", "-6", "-5", "-4", "-3", "-2", "-1",                         
                                                                                
                                                                    
        "1.0", "2.0", "4.0", "8.0", "16.0", "32.0", "64.0", "128.0",            
                                                                                
                                                                    
        "1/256", "1/128", "1/64", "1/32", "1/16", "1/8", "1/4", "1/2",          
                                                                                
                                                                    
        ">> r5", ">> 1", ">> 2", ">> 3", ">> 4", ">> 5", ">> 6", ">> 7",        
                                                                                
                                                            
        ">> 8", ">> 9", ">> 10", ">> 11", ">> 12", ">> 13", ">> 14", ">> 15"    
                                                                                
                                                            
};                                                                              
                                                                                
                                                                    

- if the value is 48 or higher then any reference (at least in the multiply 
operation) to a0...a4 or ra0...ra31 becomes  rotated, eg a0>> 13, or ra14>> 1
- the>> r5 form allows a variable rotation distance.
- a rotated rb reference would be somewhat self referential in terms of bits 
and would be referring to rb48-rb63, so i suspect its not allowed.
                                                                                
                                                                                
                                                                    
//   mulop:3 addop:5 ra:6 rb:6 adda:3 addb:3 mula:3 mulb:3, op:4 packbits:8 
addcc:3 mulcc:3 F:1 X:1 wa:6 wb:6

const char *qpu_r(uint32_t ra, uint32_t rb, uint32_t adda, uint32_t op, int 
rotator) {

    if (op == 13) {
        if (rb<48) {
            if (adda==6) return banka_r[ra];
            if (adda==7) return imm[rb];
        }
        else {
            if ((adda<6) && rotator) {
                char *tmp = tmpalloc(32);
                sprintf(tmp, "%s%s", acc_names[adda], imm[rb]);
                return tmp;
            }
            if ((adda==6) && rotator) {
                char *tmp = tmpalloc(32);
                sprintf(tmp, "%s%s", banka_r[ra], imm[rb]);
                return tmp;
            }
            if ((adda==7) && rotator) {
                return "err?";
            }
        }
    }

    if (adda==6) return banka_r[ra];
    if (adda==7) return bankb_r[rb];
    return acc_names[adda];
}

I should really revisit those names, probably clearer if i called things 
src1_mux, src2_mux etc instead of adda
 
Packer I have to re-visist.  I thought I had it, but it appears not.


> ldi with 32 bit argument:
> Original     - ldi.never -, 0x00000019 #/* 00000180: 00000019 e80009e7 */
> Compiled  - ldi.never -, 0x00000019 #/* 00000180: 00000019 e00009e7 */

My current conjecture is:
  ldi.never -, (specialReg << 16) | (value)  // *specialReg = value

invokes some extended behaviour.  when specialReg = 0, it looks like this is 
related to VPM access.  I've experimented but (seemingly) only managed to hang 
the QPU.

For instance I thought it might be setting up vpm index offsets for the write.  
Sort of like scatter/gather indices.

The FFT code is waiting for outstanding VPM write dma to finish (mov.never -, 
vw_wait) before doing the ldi.never -, 0x19 - so its a sign its touching the 
vpm.

> Note the 8 in the second word, which appears in the original binary but
> not in our decompile-recompile result. The assembly documents these 8
> bits as "unknown", any guess as for the meaning of this field?

for a ldi reg, imm32  I can see what two of the bits mean:

  // Load 2 bit unsigned vectors
  0x00000001, 0xe60208e7, // mov  r3, <1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0>                                                                        
                                                                        
  0x00010000, 0xe60208e7, // mov  r3, <2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0>                                                                        
                                                                        
  0x00010001, 0xe60208e7, // mov  r3, <3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0, 0, 0, 
0>                                                                              
                                                                 
 
  0x80000000, 0xe60208e7, // mov  r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 2>                                                                        
                                                                        
                                                  
  // Load 2 bit signed vectors                                                  
                                                                                
                                                
  0x00000000, 0xe20208e7, // mov  r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
0, 0, 0>
  0x00008000, 0xe20208e7, // mov  r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 
1>                                                                              
                                                                
 
  0x80000000, 0xe20208e7, // mov  r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 
-2>                                                                             
                                                                 
 
  0x80008000, 0xe20208e7, // mov  r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
 0, 0, 0, 0, 
-1>                                                                             
                                                                 
 
  
  // For 0xe01, 0xe02, 0xe04, 0xe08, 0xe10, 0xe80, I didn't witness any new 
behaviour

(I guess with the javascript expression evaluator I can add support for 16 
element array expr/constant quite easily).

>
> bra with return address:
> The original file contains the line:
> brr rb4, after_write_qpu_1_7 #// 0x00000268 #/* 00000210: 00000038
> f0f81127 */
> However, the description of branches in
> https://github.com/hermanhermitage/videocoreiv-qpu is
>
>
>
>    addr:32, 1111 0000 cond:4 relative:1 register:1 ra:5 X:1 wa:6 wb:6
>
> which means wa=4, wb=39 ("-"), X=1. Should the description of brr be
> "writes to the register number specified by wa bits, selects the bank
> to write to by the X bit"?

void show_qpu_branch(uint32_t i0, uint32_t i1)
{
    uint32_t addr     = i0;
    uint32_t unknown  = (i1>> 24) & 0x0f;
    uint32_t cond     = (i1>> 20) & 0x0f;
    uint32_t pcrel    = (i1>> 19) & 0x01;
    uint32_t addreg   = (i1>> 18) & 0x01;
    uint32_t ra       = (i1>> 13) & 0x1f;
    uint32_t X        = (i1>> 12) & 0x01;
    uint32_t wa       = (i1>>  6) & 0x3f;
    uint32_t wb       = (i1>>  0) & 0x3f;

    if (showfields) {
        printf("branch addr=0x%08x, unknown=%x, cond=%02d, pcrel=%x, addreg=%x, 
ra=%02d, X=%x, wa=%02d, wb=%02x\n",
            addr, unknown, cond, pcrel, addreg, ra, X, wa, wb);
    }
    // branch: b[link][cc] [linkreg,] [basedreg,]
    printf("%s%s %s; %s, %s%+d",
        pcrel ? "brr" : "bra",
        bcc[cond],
        qpu_w_add(wa, X),
        qpu_w_mul(wb, X),
        addreg ? qpu_r(ra, ra, 6, (i1>> 28)&0xf, 0) : "",
        addr);
    if (!addreg) printf(" // 0x%08x", base+addr+8*4);
    printf("\n");

}

Yes I think so.  Normally the add alu writes to bank a and the mul alu to bank 
b.  The X (crossover) bit swaps the banks over.
So wa=4, wb=39, X=1 means rb4, -  where as wa=4, wb=39, X=0 would mean ra4, -

Now I've confused myself...  I would have thought the branch instructions 
should be using the add alu for the target address calculation and the mul alu 
to mov the pc to link register.  That would mean wa is used to capture the 
target address and wb to capture the return address.  But we are seeing wa used 
to capture return address.

/HH                                       

Other related posts: