[nanomsg] Re: Planning nanomsg on multithreaded and distributed applications

From: "Garrett D'Amore" <garrett@xxxxxxxxxx>
To: "nanomsg@xxxxxxxxxxxxx" <nanomsg@xxxxxxxxxxxxx>
Date: Tue, 17 Jan 2017 11:15:52 -0800

So what you’re pointing out is that uncontended mutexes are slightly more
than twice as expensive as uncontended spinlocks.  This isn’t particularly
surprising to me — we’re talking about really really small times, and the
extra complexity of a branch inherent in the mutex could easily account for
the difference.  We’re talking about the difference between 9 ns and 20
ns.  In the grand scheme of things, these are *nothing*.  The
implementation you’re using looks like a pretty good one in fact.  I
guarantee that the 11 ns difference is *not* the result of going off to do
context switching.  It probably *is* the result of a branch that checks
whether it is appropriate to spin or not.

Conversely, a spinlock can be tragically bad if you have a situation where
the lock is under contention, since then your threads spin on CPU,
preventing other useful work from being done.  If you’re using these
spinlocks to protect a reference count (instead of atomics for example),
this looks like a good solution.  But if you are holding the locks across
longer code sections, it could be tragic.

Ultimately, if your code performance is tied to the performance of mutexes,
you’ve got a bad misdesign.  The locks should be uncontended most of the
time, and the time spent locking should be small.  I think most uses of
spinlocks are actually bad signs of premature optimization.

I’m happy to pay an extra 11 ns for a locking implementation that will work
reliably in all circumstances.  Others are welcome to disagree.  I’d rather
focus on larger things that will give me back micro- or milli-seconds as a
first order of business.  If it later turns out that profiling shows that
there is some non-negligible benefit to converting some of these locks to
spinlocks, then I’ll be happy to do so at that time.

- Garrett

On Tue, Jan 17, 2017 at 10:55 AM, Roberto Fichera <kernel@xxxxxxxxxxxxx>
wrote:

On 01/11/2017 08:32 PM, Garrett D'Amore wrote:

Thanks.  I am aware of this and may go back and swap some of these locks

for atomics (particularly where the thing protected is a reference count)
but note that atomics are not portable which is why for now I have stuck
with pthreads. I haven't seen a cause for concern with respect to
performance yet but will look at this later once the functionality is
complete.

You will not that the reference counting stuff in the code currently is

NOT on any hot code paths.  (There is a check for initialization that IS on
a hot code path but that is already designed to avoid locking altogether in
the typical case.)

Modern thread implementations use an adaptive scheme where the mutex is

a spin first and only degenerates to a sleeping mutex if the mutex is
locked by a thread that is not currently running on cpu. In my experience I
have never found it necessary to explicitly demand a spin lock.

Ok for lock contention in slow path, but recent mutex implementation are
still having
overhead. If you run the test program below, to show only lock/unlock
overhead, you
can clearly see that there is lots of difference between mutex and
spinlock implementations
even on modern implementation (glibc v2.24.4):

pthread_mutex iterations 100000, elapse time 1989837 nanosecs
pthread_spinlock iterations 100000, elapse time 878622 nanosecs

pthread_mutex iterations 100000, elapse time 2027939 nanosecs
pthread_spinlock iterations 100000, elapse time 891419 nanosecs

pthread_mutex iterations 100000, elapse time 2228180 nanosecs
pthread_spinlock iterations 100000, elapse time 1050840 nanosecs

pthread_mutex iterations 100000, elapse time 2367181 nanosecs
pthread_spinlock iterations 100000, elapse time 892797 nanosecs

#include <stdio.h>
#include <pthread.h>
#include <time.h>

static struct timespec timer_start()
{
    struct timespec start_time;

    clock_gettime( CLOCK_PROCESS_CPUTIME_ID, &start_time );

    return start_time;
}

static long timer_end( struct timespec *start_time )
{
    struct timespec end_time;

    clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &end_time);

    return end_time.tv_nsec - start_time->tv_nsec;
}

#define ITERATIONS  100000

void main()
{
    int count;
    long elapsed;
    pthread_mutex_t mutex;
    pthread_spinlock_t spinlock;
    struct timespec vartime;

    pthread_mutex_init( &mutex, NULL );

    count = ITERATIONS;
    vartime = timer_start();

    while( count-- )
    {
        pthread_mutex_lock( &mutex );
        pthread_mutex_unlock( &mutex );
    }

    elapsed = timer_end( &vartime );
    printf( "pthread_mutex iterations %d, elapse time %ld nanosecs\n",
ITERATIONS, elapsed );

    pthread_mutex_destroy( &mutex );

    pthread_spin_init( &spinlock, 0 );

    count = ITERATIONS;
    vartime = timer_start();

    while( count-- )
    {
        pthread_spin_lock( &spinlock );
        pthread_spin_unlock( &spinlock );
    }

    elapsed = timer_end( &vartime );
    printf( "pthread_spinlock iterations %d, elapse time %ld nanosecs\n",
ITERATIONS, elapsed );

    pthread_spin_destroy( &spinlock );
}

Sent from my iPhone

On Jan 11, 2017, at 10:03 AM, Roberto Fichera <kernel@xxxxxxxxxxxxx>

wrote:

On 01/04/2017 10:39 PM, Garrett D'Amore wrote:
It actually typically uses 2 threads per socket, and 2 threads per

underlying connection.

This could be altered to use a co-routine library, and its designed to

support that as part of platform porting.

Having said that, I’m *strongly* of the opinion that given a robust

and non-crappy threads implementation, that

threads will perform and scale quite highly.  Most of the complaints

about thread scalability come from three areas:

a) Poor application/library design leading to lots of lock

contention.  Uncontended mutexes are cheap.  Contended ones

are not.  I’ve designed to minimize contention.

b) Stack consumption.  Generally threads *do* each have their own

stack.  I’ve taken care to keep my stacks shallow,

so its unlikely that we would ever need more than a single page per

thread.  At 4K page sizes, this means that you can

have 1000 threads (around 500 connections) in only 4MB RAM.  If you

need 1M connections, you might feel the problem.

You’ll run out of TCP ports first though.

c) Crappy threading implementations.  This is largely a thing of the

past.  Modern thread libraries are quite

performant and scale particularly well.

Conversely, threading leads to inherently better multi-core

scalability, giving huge performance wins on larger

systems.  And, as a particularly nice bonus, the logic flow in single

threads is *lots* easier to understand.

Anyway, so that’s my thinking.  Once I’ve gotten a little further we

will be able to test these theories with actual

scalability and performance tests.  If it turns out that I’ve

misjudged, it will still be possible to retrofit some

kind of coroutine API.

One of the thing regarding pthread mutex and all synchronization

primitives, even if there is not contended,

they are still a preemption point, so the scheduler can decide to move

the thread in the waitqueue, impacting

the performance. Certain pattern like:

   nni_mtx_lock(&pair->mx);
   pair->refcnt--;
   if (pair->refcnt == 0) {
       nni_mtx_unlock(&pair->mx);
       nni_inproc_pair_destroy(pair);
   } else {
       nni_mtx_unlock(&pair->mx);
   }

can be replaced by atomic operations where you pay only bus transaction

and cache line lock, gcc for example offers,

something like below:

#define atomic_sub(__var, __value) __sync_sub_and_fetch( __var, __value

)

#define atomic_get( __var ) ( __sync_synchronize(), *__var )

   if (atomic_sub( &pair->refcnt, 1 ) == 0) {
       nni_inproc_pair_destroy(pair);
   }

   if (atomic_get( &pair->refcnt ) == 0) {
       nni_inproc_pair_destroy(pair);
   }

windows can be implemented easily as well

#define atomic_get( __var ) ( _ReadWriteBarrier(), *__var )
#define atomic_sub( __var, __value ) _InterlockedExchangeAdd( __var,

-__value )

Quite often for shorter lock&unlock pair is better to use spinlocks

instead of normal mutex to avoid

totally to sleep

Follow-Ups:
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Michael Powell
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera

References:
- [nanomsg] Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: James Root
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera

[nanomsg] Re: Planning nanomsg on multithreaded and distributed applications

Other related posts: