Re: Memory operations on Sun/Oracle M class servers vs T class servers

  • From: Tanel Poder <tanel@xxxxxxxxxxxxxx>
  • To: Mark Burgess <mark@xxxxxxxxxxxxxxxxxxxxxxxxx>
  • Date: Tue, 16 Dec 2014 22:49:22 -0400

The CPU die has limited "real estate" for stuff.

Each CPU core takes space (and cache too). The more complex and
sophisticated the CPU core microarchitecture is, the more space it takes.

The design tradeoff between the classic CPU architecture and T-series CMT
architecture was to reduce the CPU core complexity so you'd be able to put
more cores on the chip.

Less sophisticated microarchitecture means there's less (or no)
prefetching, pipelining, out-of-order execution, branch prediction and
instruction level parallelism going on inside the CPU - so you just spend
more CPU cycles per instruction (stalling for memory access and other
stuff) when executing code. That's why the single threaded performance
sucked. This was somewhat compensated by having many sets of registers
(virtual CPU threads) built in to the same core - so when one thread was
waiting (stalled), then another thread's instructions could be scheduled on
the same core's execution units. With lots of threads you can get decent
performance out of the modern T-series CPUs. Note that vmstat and OS-kernel
level tools are useless for CPU capacity planning on these platforms as how
much a single thread gets done depends on how utilized the core itself is.
Corestat is the utility for measuring the actual core "busyness" on Solaris.


Of course the issue may be totally somewhere else (a'la parameter,
environment differences, bugs etc etc or number of DIMM slots you have
actually filled in the server - more is better :)

Solaris tools cputrack and cpustat (available on SPARC since Solaris 8!) or
DTrace's CPC counters (Solaris 11) would allow you to drill down into the
wonderful world of CPU performance counters that help to break down where
your CPU cycles get used or wasted. Especially relevant in the in-memory
days.

Tanel.

On Tue, Dec 16, 2014 at 6:47 PM, Mark Burgess <
mark@xxxxxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> Finn,
>
> I have been seeking answers to the same types of questions for a customer
> site for the past 3 years on T4 and T3 hardware. Others will be able to
> offer a far more scientific explanation as to why but in a nutshell the T
> series platform seems to be good for doing lots of things concurrently as
> opposed to one thing particularly fast. The classic response time v
> throughput discussion. The types of problems you describe below are exactly
> the types of issues I have encountered - single threaded
> processes…ie..SQL/PLSQL take longer to run than what you would expect.
> Unfortunately I have not been in a position to be able to perform a like
> for like comparison against another platform to provide some science behind
> the analysis. I have used parallel query selectively to resolve single
> threaded performance issues however I do not see this as being a viable
> approach to work around all the performance constraints on this platform.
>
> I have been looking to setup SLOB to compare T4, X4-2L and Exa X4-2
> timings however this is still on the to-do list as I believe this is the
> only way to provide a comparative measure to compare T series against other
> platforms.
>
> Regards,
>
> Mark
>
>
>
>

Other related posts: