[haiku-commits] Re: haiku: hrev44282 - src/system/kernel/arch/x86

  • From: "Ingo Weinhold" <ingo_weinhold@xxxxxx>
  • To: haiku-commits@xxxxxxxxxxxxx
  • Date: Tue, 03 Jul 2012 01:57:09 +0200

Rene Gollent wrote:
> On Mon, Jul 2, 2012 at 5:59 PM, Ingo Weinhold <ingo_weinhold@xxxxxx> wrote:
> > Not that performance matters in this implementation anyway, but the cmp can 
> > be omitted. dec already updates ZF.
> 
> Good to know, thanks, like I said x86 isn't one I've had much past
> experience with.
> > 4 instructions per byte is rather suboptimal (reordering movsb and cmpb 
> > might be slightly better).
> 
> Fair enough on the reorder, but since the rep / mov variation is
> apparently out of the question, I'm not really sure as to a more
> compact way to write it. Thoughts?

I can only think of moving multiple bytes at once.

> > A problem I notice now (and which the previous implementation had, too) is 
> > that the source memory location is accessed twice. If that is user memory 
> > it could actually change in the meantime, thus possibly resulting in an 
> > unterminated destination buffer. A safe (though even slower) implementation 
> > would be:
> 
> Wouldn't that imply that the source program is doing something not
> thread-safe? I didn't think strlcpy and friends guaranteed thread
> safety, especially not concurrent modification of the string being
> copied.

Indeed, that wouldn't be thread safe and strlcpy() and friends aren't supposed 
to be thread safe either. But this isn't what this is about. We want to keep 
the kernel safe from things a userland application may do (a security issue, if 
you want). The contract for a user_strlcpy() from user memory to kernel memory 
is that it either returns an error or results in a properly null-terminated 
buffer. Whether it contains a mix of two versions due to the source buffer 
changing concurrently doesn't matter -- as long as it is properly 
null-terminated.

[...]
> Thanks! Will revise shortly. While we're on the subject, do you think
> it'd be worth reworking arch_cpu_user_memcpy() and
> arch_cpu_user_memset() to operate a dword at a time for as much of the
> data as possible and then copying the remaining 1-3 bytes if needed?

Performance-wise it will very likely be worth it. I believe the fastest 
implementations use such an algorithm. Processor family/model specific versions 
(using MMX, SSE, and friends) might be even better. Note that we already have a 
mechanism for optimized memcpy() and other functions provided by the CPU module 
(IIRC, no actual implementations, though). The user_*() versions could be added 
as well.

CU, Ingo

Other related posts: