Hi Shachar! Thankyou for your post. > First of all, kudos on getting the QPU assembly to this level. Thanks. > I have been working on getting the shader_256.s example to compile. Got > few fixes in the pipe, and now the assembler doesn't crash when > compiling. Sent a pull request with what I have so far. I am still Ok, I merged that. I guess I should add /* ... */ support, alas as you can see I didnt write a real lexer... My plan was to assemble files with: /* addr: word0 word1 */ assembly #comment and do an assertion check that word0 and word1 match the assembled code. > missing the rotator and packer support - any chance you can document > this in the bit level? Ok, I just pushed the latest qpudis.c Rotator: - if the third slot op is 13, then the 6 bits of register bank b reference decode into the immediate table: const char *imm[] = { "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "-16", "-15", "-14", "-13", "-12", "-11", "-10", "-9", "-8", "-7", "-6", "-5", "-4", "-3", "-2", "-1", "1.0", "2.0", "4.0", "8.0", "16.0", "32.0", "64.0", "128.0", "1/256", "1/128", "1/64", "1/32", "1/16", "1/8", "1/4", "1/2", ">> r5", ">> 1", ">> 2", ">> 3", ">> 4", ">> 5", ">> 6", ">> 7", ">> 8", ">> 9", ">> 10", ">> 11", ">> 12", ">> 13", ">> 14", ">> 15" }; - if the value is 48 or higher then any reference (at least in the multiply operation) to a0...a4 or ra0...ra31 becomes rotated, eg a0>> 13, or ra14>> 1 - the>> r5 form allows a variable rotation distance. - a rotated rb reference would be somewhat self referential in terms of bits and would be referring to rb48-rb63, so i suspect its not allowed. // mulop:3 addop:5 ra:6 rb:6 adda:3 addb:3 mula:3 mulb:3, op:4 packbits:8 addcc:3 mulcc:3 F:1 X:1 wa:6 wb:6 const char *qpu_r(uint32_t ra, uint32_t rb, uint32_t adda, uint32_t op, int rotator) { if (op == 13) { if (rb<48) { if (adda==6) return banka_r[ra]; if (adda==7) return imm[rb]; } else { if ((adda<6) && rotator) { char *tmp = tmpalloc(32); sprintf(tmp, "%s%s", acc_names[adda], imm[rb]); return tmp; } if ((adda==6) && rotator) { char *tmp = tmpalloc(32); sprintf(tmp, "%s%s", banka_r[ra], imm[rb]); return tmp; } if ((adda==7) && rotator) { return "err?"; } } } if (adda==6) return banka_r[ra]; if (adda==7) return bankb_r[rb]; return acc_names[adda]; } I should really revisit those names, probably clearer if i called things src1_mux, src2_mux etc instead of adda Packer I have to re-visist. I thought I had it, but it appears not. > ldi with 32 bit argument: > Original - ldi.never -, 0x00000019 #/* 00000180: 00000019 e80009e7 */ > Compiled - ldi.never -, 0x00000019 #/* 00000180: 00000019 e00009e7 */ My current conjecture is: ldi.never -, (specialReg << 16) | (value) // *specialReg = value invokes some extended behaviour. when specialReg = 0, it looks like this is related to VPM access. I've experimented but (seemingly) only managed to hang the QPU. For instance I thought it might be setting up vpm index offsets for the write. Sort of like scatter/gather indices. The FFT code is waiting for outstanding VPM write dma to finish (mov.never -, vw_wait) before doing the ldi.never -, 0x19 - so its a sign its touching the vpm. > Note the 8 in the second word, which appears in the original binary but > not in our decompile-recompile result. The assembly documents these 8 > bits as "unknown", any guess as for the meaning of this field? for a ldi reg, imm32 I can see what two of the bits mean: // Load 2 bit unsigned vectors 0x00000001, 0xe60208e7, // mov r3, <1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> 0x00010000, 0xe60208e7, // mov r3, <2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> 0x00010001, 0xe60208e7, // mov r3, <3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> 0x80000000, 0xe60208e7, // mov r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2> // Load 2 bit signed vectors 0x00000000, 0xe20208e7, // mov r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0> 0x00008000, 0xe20208e7, // mov r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1> 0x80000000, 0xe20208e7, // mov r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -2> 0x80008000, 0xe20208e7, // mov r3, <0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1> // For 0xe01, 0xe02, 0xe04, 0xe08, 0xe10, 0xe80, I didn't witness any new behaviour (I guess with the javascript expression evaluator I can add support for 16 element array expr/constant quite easily). > > bra with return address: > The original file contains the line: > brr rb4, after_write_qpu_1_7 #// 0x00000268 #/* 00000210: 00000038 > f0f81127 */ > However, the description of branches in > https://github.com/hermanhermitage/videocoreiv-qpu is > > > > addr:32, 1111 0000 cond:4 relative:1 register:1 ra:5 X:1 wa:6 wb:6 > > which means wa=4, wb=39 ("-"), X=1. Should the description of brr be > "writes to the register number specified by wa bits, selects the bank > to write to by the X bit"? void show_qpu_branch(uint32_t i0, uint32_t i1) { uint32_t addr = i0; uint32_t unknown = (i1>> 24) & 0x0f; uint32_t cond = (i1>> 20) & 0x0f; uint32_t pcrel = (i1>> 19) & 0x01; uint32_t addreg = (i1>> 18) & 0x01; uint32_t ra = (i1>> 13) & 0x1f; uint32_t X = (i1>> 12) & 0x01; uint32_t wa = (i1>> 6) & 0x3f; uint32_t wb = (i1>> 0) & 0x3f; if (showfields) { printf("branch addr=0x%08x, unknown=%x, cond=%02d, pcrel=%x, addreg=%x, ra=%02d, X=%x, wa=%02d, wb=%02x\n", addr, unknown, cond, pcrel, addreg, ra, X, wa, wb); } // branch: b[link][cc] [linkreg,] [basedreg,] printf("%s%s %s; %s, %s%+d", pcrel ? "brr" : "bra", bcc[cond], qpu_w_add(wa, X), qpu_w_mul(wb, X), addreg ? qpu_r(ra, ra, 6, (i1>> 28)&0xf, 0) : "", addr); if (!addreg) printf(" // 0x%08x", base+addr+8*4); printf("\n"); } Yes I think so. Normally the add alu writes to bank a and the mul alu to bank b. The X (crossover) bit swaps the banks over. So wa=4, wb=39, X=1 means rb4, - where as wa=4, wb=39, X=0 would mean ra4, - Now I've confused myself... I would have thought the branch instructions should be using the add alu for the target address calculation and the mul alu to mov the pc to link register. That would mean wa is used to capture the target address and wb to capture the return address. But we are seeing wa used to capture return address. /HH