[raspi-internals] Re: 2op vs 3op ALU operations

From: Herman Hermitage <hermanhermitage@xxxxxxxxxxx>
To: "raspi-internals@xxxxxxxxxxxxx" <raspi-internals@xxxxxxxxxxxxx>
Date: Thu, 23 May 2013 16:12:06 +1200

> l10:
>          st      r0,(r1)++
>          addcmpblt       r0,1,16,l10
>          mov     r0,0
>          rts

Wonderful!

> Still, my first dhrystone benchmark runs are pretty abysmal compared to 
> the ARM side. However, I hope this is due to cache issues and maybe it 
> will get better when I get my data/bss moved out of the text section.

Here is some information from correspondence I had last year with eizo-san 
(when he was working on SIMD acceleration of graphics primitives):

Eizo-san wrote: "
I've tried timing some instructions just for the lols. Data is not 
VPU L1 cached, but it is L2 cached. I'm super guessing the clock speed 
is 250 MHz.

Here's my findings:
- the scalar unit in the VPU can do dual issue, ie can issue two independent 
instructions per clock.
- the average scalar instruction takes one cycle to execute, and dependent 
instructions can go back-to-back with no trouble.
- I have a feeling that a flag setting instruction plus bcc takes 
two cycles for the bcc to happen, otherwise the bcc will overlap with 
other scalars meaning the bcc takes just one cycle (this seems too good 
to be true...branch prediction?)
- vector instructions cannot dual issue with other vector instructions
- you can dual issue vector instructions with scalar instructions 
(vector+scalar, scalar+scalar, but not vector+vector)
- the average vector instruction takes one cycle, at 250 MHz
- vector multiplies take 2 cycles always
- instructions which do a >> 8 or a sat or clamp seem to take 2 cycles
- back to back vector loads to the same address takes 14 cycles, 
this appears to be the same regardless of the register target (and 
regardless of the size, although you seem to save a cycle or two going 
from a 64-byte load to 32 bytes or smaller)
- setting the increment flags on a non-incrementing instruction does not cost 
you
- using a 2x repeat is the same as issuing the same instruction twice
- oddly using a 2x repeat on a mul (2-cycle) is barely slower than a single 
mul. Yet using 2xmul is significantly slower!
"
The dual issue fits what I have observed as well with seeing the pc pushed 
during exception handling (or the pipeline is twice the length I think it is).

Code is L1 cached.

Data I think should be possible as L1 cached with the right 
allocation/addressing, but I think buffers tend to be in L2 (which is usually 
unified with ARM) for vector instruction and ARM synchronization.  I think from 
memory a write from the vector memory unit, knows to mark any hit in L1 as 
invalid, but it writes direct into L2 or memory, without updating L1.  ie the 
architecture is basically configured to avoid vector kernels polluting L1 (see 
the narrow/wide patents).

Other random scribblings from previous emails:

Herman wrote:"
"Not really vector but:

From one of the patents:

[0083] Each 
of the dual issue ALU 334 and 344 may comprise suitable logic, 
circuitry, code, and/or interfaces that may be operable to perform 
superscalar execution, to issue two integer operations,
and to issue an integer operation and a floating-point operation 
concurrently. Integer operations may be able to execute in a single 
cycle and a forwarding path may be provided such that the result can be 
used by the following instruction without incurring
any stalls. Complex integer operations may be pipelined over two cycles,
 for example. In such instances, a single pipeline stall may be inserted
 if the following instruction references the result. Floating-point 
operations may be able to execute over three
clock cycles, for example. These operations may be pipelined such that a
 floating-point operation may be issued at each clock cycle. However, a 
pipeline stall may be inserted if either of the two following 
instructions references the result.
"

Herman wrote: "
Can
 confirm 250MHz (well we were running 19MHz at bootcode.bin launch - so 
much quicker under kernel.img once all PLLs etc have been set up by 
bootloaders).

Agreed with dual issue scalar - separate integer and floating point 
pipes.  I think I measured a depth of 5-7 stages for integer operations 
as well (depending on if NOP is dual issued - might be half this).

My hunch is the vector unit has weak pipelining between instructions of 
certain types (there are some issues with the design that I see - the 
same way as the barrel shifter in the ARM is a mess for efficient 
silicon implementations).  So I would expect looped
instructions to run faster than non looped sequences due to eliminating 
the decode stage and/or reducing contention on scalar register bank.

What is not clear still (you'll laugh at this)... is the vector unit 
shared between two scalar cores, or is there a vector unit per core 
(possibly shared all, shared register bank, shared vALU, etc).  Its 
mostly irrelevant with the mailbox setup, more relevant
for baremetal videocore.
"

Herman.

References:
- [raspi-internals] 2op vs 3op ALU operations
  - From: David Given
- [raspi-internals] Re: 2op vs 3op ALU operations
  - From: Herman Hermitage
- [raspi-internals] Re: 2op vs 3op ALU operations
  - From: David Given
- [raspi-internals] Re: 2op vs 3op ALU operations
  - From: Herman Hermitage
- [raspi-internals] Re: 2op vs 3op ALU operations
  - From: David Given
- [raspi-internals] Re: 2op vs 3op ALU operations
  - From: Volker Barthelmann

[raspi-internals] Re: 2op vs 3op ALU operations

Other related posts: