[nanomsg] Re: [Non-DoD Source] Re: Planning nanomsg on multithreaded and distributed applications

From: Garrett D'Amore <garrett@xxxxxxxxxx>
To: nanomsg@xxxxxxxxxxxxx
Date: Wed, 11 Jan 2017 13:30:27 -0800

I am using a subset of C99. C11 is barely present or entirely absent many
platforms and even then the atomics are, as you noticed, optional.

Sent from my iPhone

On Jan 11, 2017, at 1:14 PM, Karan, Cem F CIV USARMY RDECOM ARL (US)
<cem.f.karan.civ@xxxxxxxx> wrote:

Which flavor of C are you targeting?  <stdatomics.h> is a part of the C11
standard, which may help out (barring a compiler that defines
__STDC_NO_ATOMICS__).

Thanks,
Cem Karan

-----Original Message-----
From: nanomsg-bounce@xxxxxxxxxxxxx [mailto:nanomsg-bounce@xxxxxxxxxxxxx] On ;
Behalf Of Garrett D'Amore
Sent: Wednesday, January 11, 2017 2:32 PM
To: nanomsg@xxxxxxxxxxxxx
Subject: [Non-DoD Source] [nanomsg] Re: Planning nanomsg on multithreaded
and distributed applications

Thanks.  I am aware of this and may go back and swap some of these locks for
atomics (particularly where the thing protected is a
reference count) but note that atomics are not portable which is why for now
I have stuck with pthreads. I haven't seen a cause for
concern with respect to performance yet but will look at this later once the
functionality is complete.

You will not that the reference counting stuff in the code currently is NOT
on any hot code paths.  (There is a check for initialization that IS
on a hot code path but that is already designed to avoid locking altogether
in the typical case.)

Modern thread implementations use an adaptive scheme where the mutex is a
spin first and only degenerates to a sleeping mutex if the
mutex is locked by a thread that is not currently running on cpu. In my
experience I have never found it necessary to explicitly demand a
spin lock.

Sent from my iPhone

On Jan 11, 2017, at 10:03 AM, Roberto Fichera <kernel@xxxxxxxxxxxxx> wrote:

On 01/04/2017 10:39 PM, Garrett D'Amore wrote:
It actually typically uses 2 threads per socket, and 2 threads per
underlying connection.

This could be altered to use a co-routine library, and its designed to
support that as part of platform porting.
Having said that, I’m *strongly* of the opinion that given a robust
and non-crappy threads implementation, that threads will perform and scale
quite highly.  Most of the complaints about thread

scalability come from three areas:

a) Poor application/library design leading to lots of lock
contention.  Uncontended mutexes are cheap.  Contended ones are not.  I’ve
designed to minimize contention.

b) Stack consumption.  Generally threads *do* each have their own
stack.  I’ve taken care to keep my stacks shallow, so its unlikely
that we would ever need more than a single page per thread.  At 4K page
sizes, this means that you can have 1000 threads (around

500 connections) in only 4MB RAM.  If you need 1M connections, you might
feel the problem.

You’ll run out of TCP ports first though.

c) Crappy threading implementations.  This is largely a thing of the
past.  Modern thread libraries are quite performant and scale particularly
well.

Conversely, threading leads to inherently better multi-core
scalability, giving huge performance wins on larger systems.  And, as a
particularly nice bonus, the logic flow in single threads is *lots*

easier to understand.

Anyway, so that’s my thinking.  Once I’ve gotten a little further we
will be able to test these theories with actual scalability and
performance tests.  If it turns out that I’ve misjudged, it will still be
possible to retrofit some kind of coroutine API.

One of the thing regarding pthread mutex and all synchronization
primitives, even if there is not contended, they are still a
preemption point, so the scheduler can decide to move the thread in the
waitqueue, impacting the performance. Certain pattern like:

  nni_mtx_lock(&pair->mx);
  pair->refcnt--;
  if (pair->refcnt == 0) {
      nni_mtx_unlock(&pair->mx);
      nni_inproc_pair_destroy(pair);
  } else {
      nni_mtx_unlock(&pair->mx);
  }

can be replaced by atomic operations where you pay only bus
transaction and cache line lock, gcc for example offers, something like
below:

#define atomic_sub(__var, __value) __sync_sub_and_fetch( __var,
__value ) #define atomic_get( __var ) ( __sync_synchronize(), *__var )

  if (atomic_sub( &pair->refcnt, 1 ) == 0) {
      nni_inproc_pair_destroy(pair);
  }

  if (atomic_get( &pair->refcnt ) == 0) {
      nni_inproc_pair_destroy(pair);
  }

windows can be implemented easily as well

#define atomic_get( __var ) ( _ReadWriteBarrier(), *__var ) #define
atomic_sub( __var, __value ) _InterlockedExchangeAdd( __var, -__value
)

Quite often for shorter lock&unlock pair is better to use spinlocks
instead of normal mutex to avoid totally to sleep

Follow-Ups:
- [nanomsg] Re: [Non-DoD Source] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Karan, Cem F CIV USARMY RDECOM ARL (US)

References:
- [nanomsg] Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: James Root
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Roberto Fichera
- [nanomsg] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Garrett D'Amore
- [nanomsg] Re: [Non-DoD Source] Re: Planning nanomsg on multithreaded and distributed applications
  - From: Karan, Cem F CIV USARMY RDECOM ARL (US)

[nanomsg] Re: [Non-DoD Source] Re: Planning nanomsg on multithreaded and distributed applications

Other related posts: