> l10: > st r0,(r1)++ > addcmpblt r0,1,16,l10 > mov r0,0 > rts Wonderful! > Still, my first dhrystone benchmark runs are pretty abysmal compared to > the ARM side. However, I hope this is due to cache issues and maybe it > will get better when I get my data/bss moved out of the text section. Here is some information from correspondence I had last year with eizo-san (when he was working on SIMD acceleration of graphics primitives): Eizo-san wrote: " I've tried timing some instructions just for the lols. Data is not VPU L1 cached, but it is L2 cached. I'm super guessing the clock speed is 250 MHz. Here's my findings: - the scalar unit in the VPU can do dual issue, ie can issue two independent instructions per clock. - the average scalar instruction takes one cycle to execute, and dependent instructions can go back-to-back with no trouble. - I have a feeling that a flag setting instruction plus bcc takes two cycles for the bcc to happen, otherwise the bcc will overlap with other scalars meaning the bcc takes just one cycle (this seems too good to be true...branch prediction?) - vector instructions cannot dual issue with other vector instructions - you can dual issue vector instructions with scalar instructions (vector+scalar, scalar+scalar, but not vector+vector) - the average vector instruction takes one cycle, at 250 MHz - vector multiplies take 2 cycles always - instructions which do a >> 8 or a sat or clamp seem to take 2 cycles - back to back vector loads to the same address takes 14 cycles, this appears to be the same regardless of the register target (and regardless of the size, although you seem to save a cycle or two going from a 64-byte load to 32 bytes or smaller) - setting the increment flags on a non-incrementing instruction does not cost you - using a 2x repeat is the same as issuing the same instruction twice - oddly using a 2x repeat on a mul (2-cycle) is barely slower than a single mul. Yet using 2xmul is significantly slower! " The dual issue fits what I have observed as well with seeing the pc pushed during exception handling (or the pipeline is twice the length I think it is). Code is L1 cached. Data I think should be possible as L1 cached with the right allocation/addressing, but I think buffers tend to be in L2 (which is usually unified with ARM) for vector instruction and ARM synchronization. I think from memory a write from the vector memory unit, knows to mark any hit in L1 as invalid, but it writes direct into L2 or memory, without updating L1. ie the architecture is basically configured to avoid vector kernels polluting L1 (see the narrow/wide patents). Other random scribblings from previous emails: Herman wrote:" "Not really vector but: From one of the patents: [0083] Each of the dual issue ALU 334 and 344 may comprise suitable logic, circuitry, code, and/or interfaces that may be operable to perform superscalar execution, to issue two integer operations, and to issue an integer operation and a floating-point operation concurrently. Integer operations may be able to execute in a single cycle and a forwarding path may be provided such that the result can be used by the following instruction without incurring any stalls. Complex integer operations may be pipelined over two cycles, for example. In such instances, a single pipeline stall may be inserted if the following instruction references the result. Floating-point operations may be able to execute over three clock cycles, for example. These operations may be pipelined such that a floating-point operation may be issued at each clock cycle. However, a pipeline stall may be inserted if either of the two following instructions references the result. " Herman wrote: " Can confirm 250MHz (well we were running 19MHz at bootcode.bin launch - so much quicker under kernel.img once all PLLs etc have been set up by bootloaders). Agreed with dual issue scalar - separate integer and floating point pipes. I think I measured a depth of 5-7 stages for integer operations as well (depending on if NOP is dual issued - might be half this). My hunch is the vector unit has weak pipelining between instructions of certain types (there are some issues with the design that I see - the same way as the barrel shifter in the ARM is a mess for efficient silicon implementations). So I would expect looped instructions to run faster than non looped sequences due to eliminating the decode stage and/or reducing contention on scalar register bank. What is not clear still (you'll laugh at this)... is the vector unit shared between two scalar cores, or is there a vector unit per core (possibly shared all, shared register bank, shared vALU, etc). Its mostly irrelevant with the mailbox setup, more relevant for baremetal videocore. " Herman.