[haiku-commits] Re: BRANCH pdziepak-github.memcpy-v2 [c4c0758] src/system/kernel/arch/x86/64 src/system/libroot/posix/string/arch/x86_64 src/system/kernel/arch/x86 headers/private/kernel/arch/x86/64

  • From: Paweł Dziepak <pdziepak@xxxxxxxxxxx>
  • To: haiku-commits@xxxxxxxxxxxxx
  • Date: Wed, 10 Sep 2014 22:36:14 +0200

2014-09-10 22:31 GMT+02:00 pdziepak-github.memcpy-v2 <community@xxxxxxxxxxxx
>:

> added 2 changesets to branch 'refs/remotes/pdziepak-github/memcpy-v2'
> old head: bb159c73fb99410fb591ee25b1fe0b92cd5870a5
> new head: c4c07587ea65bf0a7854ca91204324301b72e8df
> overview: https://github.com/pdziepak/Haiku/compare/bb159c7...c4c0758
>
>
> ----------------------------------------------------------------------------
>
> f4c3362: kernel/x86_64: save fpu state at interrupts
>
>   The kernel is allowed to use fpu anywhere so we must make sure that
>   user state is not clobbered by saving fpu state at interrupt entry.
>   There is no need to do that in case of system calls since all fpu
>   data registers are caller saved.
>
>   We do not need, though, to save the whole fpu state at task swich
>   (again, thanks to calling convention). Only status and control
>   registers are preserved. This patch actually adds xmm0-15 register
>   to clobber list of task swich code, but the only reason of that is
>   to make sure that nothing bad happens inside the function that
>   executes that task swich. Inspection of the generated code shows
>   that no xmm registers are actually saved.
>

I admit, this is mess. Definitely, the next thing I am going to do for
Haiku is rewriting kernel entry/exit code.


>   Signed-off-by: Paweł Dziepak <pdziepak@xxxxxxxxxxx>
>
> c4c0758: libroot/x86_64: new memcpy implementation
>
>   This patch introduces new memcpy() implementation that improves the
>   performance when the buffer is small. It was written for processors that
>   support ERMSB, but performs reasonably well on older CPUs as well.
>
>   The following benchmarks were done on Haswell i7 running Debian Jessie
>   with Linux 3.16.1. In each iteration 64MB buffer was copied, the
>   parameter "size" is the size of the buffer passed in a single call (i.e.
>   for "size: 2" memcpy() was called ~32 million times to copy the whole
>   64MB).
>
>   f - original implementation, g - new implementation, all buffers 16 byte
>   aligned
>
>   cpy, size:        8, f:    79971 µs, g:    20419 µs, ∆:   74.47%
>   cpy, size:       32, f:    42068 µs, g:    12159 µs, ∆:   71.10%
>   cpy, size:      128, f:    13408 µs, g:    10359 µs, ∆:   22.74%
>   cpy, size:      512, f:    10634 µs, g:    10433 µs, ∆:    1.89%
>   cpy, size:     1024, f:    10474 µs, g:    10536 µs, ∆:   -0.59%
>   cpy, size:     4096, f:     9419 µs, g:     8630 µs, ∆:    8.38%
>
>   f - glibc 2.19 implementation, g - new implementation, all buffers 16
> byte
>   aligned
>
>   cpy, size:        8, f:    26299 µs, g:    20919 µs, ∆:   20.46%
>   cpy, size:       32, f:    11146 µs, g:    12159 µs, ∆:   -9.09%
>   cpy, size:      128, f:    10778 µs, g:    10354 µs, ∆:    3.93%
>   cpy, size:      512, f:    12291 µs, g:    10426 µs, ∆:   15.17%
>   cpy, size:     1024, f:    13923 µs, g:    10571 µs, ∆:   24.08%
>   cpy, size:     4096, f:    11770 µs, g:     8671 µs, ∆:   26.33%
>
>   f - glibc 2.19 implementation, g - new implementation, all buffers
> unaligned
>
>   cpy, size:       16, f:    13376 µs, g:    13009 µs, ∆:    2.74%
>   cpy, size:       32, f:    11130 µs, g:    12171 µs, ∆:   -9.35%
>   cpy, size:       64, f:    11017 µs, g:    11231 µs, ∆:   -1.94%
>   cpy, size:      128, f:    10884 µs, g:    10407 µs, ∆:    4.38%
>   cpy, size:      256, f:    10826 µs, g:    10106 µs, ∆:    6.65%
>   cpy, size:      512, f:    12354 µs, g:    10396 µs, ∆:   15.85%
>

If anyone is interested this is the code used for measurements (both memcpy
and memset): https://gist.github.com/pdziepak/2ae4e5ea88e8477d136a

  Signed-off-by: Paweł Dziepak <pdziepak@xxxxxxxxxxx>
>
>                                     [ Paweł Dziepak <pdziepak@xxxxxxxxxxx>
> ]
>
>
> ----------------------------------------------------------------------------
>
> 11 files changed, 244 insertions(+), 67 deletions(-)
> headers/private/kernel/arch/x86/64/cpu.h         |  10 +-
> headers/private/kernel/arch/x86/64/iframe.h      |   1 +
> headers/private/kernel/arch/x86/arch_cpu.h       |  11 +-
> src/system/kernel/arch/x86/64/arch.S             |  29 -----
> src/system/kernel/arch/x86/64/interrupts.S       |  67 ++++++++--
> src/system/kernel/arch/x86/64/thread.cpp         |  31 ++---
> src/system/kernel/arch/x86/arch_cpu.cpp          |   5 +-
> src/system/kernel/arch/x86/arch_thread.cpp       |   4 +
> .../kernel/arch/x86/arch_user_debugger.cpp       |  23 +++-
> src/system/kernel/arch/x86/asm_offsets.cpp       |   1 +
> .../posix/string/arch/x86_64/arch_string.cpp     | 129 ++++++++++++++++++-
>
<snip, full diff>

Other related posts: