[nanomsg] Re: The Secret To 10 Million Concurrent Connections....

From: "Garrett D'Amore" <garrett@xxxxxxxxxx>
To: nanomsg@xxxxxxxxxxxxx
Date: Tue, 05 Dec 2017 00:13:42 +0000

This article gets a lot right, but it gets a *lot* wrong too.  It
represents the thinking that TCP stacks should be in userland, as
epitomized by SolarFlare based designs.

The problem with this is that you *still* have a kernel, and the kernel is
responsible for doing a lot of things, not just networking — *and you need
it to*.   It also assumes that you have a workload where it makes *sense*
to allocate some fixed number of CPU cores to your networking stack, and
that you have only one or two NICs (or NIC rings) and one or two networking
applications on the same host.

This approach is all wrong for real world general purpose servers.  It can
work for dedicated servers.

In the real world, you often have many applications, many threads, and lots
of tasks (e.g. database I/O, etc.) that you might not fully understand.
Robbing the kernel of resources because you think you know better is quite
often a mistake.

The *real* problem is the overhead with context switching between userland
and the kernel, and copying data between user applications and the kernel.

Solutions to this problem can involve hacks like moving TCP into user
space, but often a better solution, that scales equally well — sometimes
much better — exists.  That solution is to move your application into the
kernel itself.  This comes at a hefty engineering cost, but you can still
benefit from the full benefits of decades of research and improvement in
the networking stack that exists in your kernel, as well as the great
concurrency support available in modern kernels.

I’ve got some real-world experience here; I can’t say I supported 10M
connections, but a few tens of thousands was done *trivially*, with very
little load on the box itself.

The other thing the article gets badly is the idea that somehow we need to
support 10M connections on a single box.  Unless the connections are mostly
idle, this is absurd.  The reason it’s absurd is a few things:

a) The bandwidth needs for such a box are going to be huge.  If 10M
connections are doing 10Kbps (*very modest traffic*) we’re talking about
100Gbps (without consideration of overheads).   Your twitter feed might be
less than 10Kbps, but most other things are *not*.

b) The compute associated with doing non-trivial work for 10M connections,
even if each is only doing 1 packet per second, is *high*.  10Mpps is
achievable with modern systems, if the workload is *tiny*.  Even the work
required to process ordinary TCP frames is going to be quite severe.  Now
understand that modern protocols all involve crypto — and despair.

c) “Modern scalable webservers” (*hah*) run single threaded.  This means
that they have a single thread that processes all I/O, and rely on
intelligent epoll to prevent them from having to do O(n) scans each time
they are woken.  Typically you’re going to have to break such a workload
up, because a single CPU core is simply going to be unable to churn
non-trivial amounts of data.  (At 10Kbps per conn you’re talking about 6
packets per second per connection, so now we’re at 60 Mpps!)   (Note to run
at this rate, you basically shut off interrupts altogether, and let the
server thread spin watching for status changes in the registers that track
NIC ring descriptor indices.)

d) Fault impacts.  This one is serious.  If you’ve got 10M connects and
your server fails, what happens?  You upset 10M TCP connections.  Now you
either have 10M unhappy customers, or you have to reestablish 10M
connections.  Any idea what the overhead of setting up 10M new TCP
connections is like?    Especially if we are also going to reestablish
crypto contexts?!?

e) TCP port numbers.  There are 64K potential port numbers.  Even if you
use all of them, you wind up having to some ugly things by including the
source address in your look ups.  Most TCP stacks don’t allow for TCP port
numbers to be shared in this way.

f) Manageability.   If you have 10M connections, imagine the output from
netstat.  Worse, imagine collecting details on a per-connection basis.  I
can attest to this problem personally — when I had thousands of connections
in kernel, I also had very high levels of statistics reported through the
in-kernel kstat scheme.  The problem was that the effort to copy those
statistics out from kernel to user space was … excessive.  In fact, this
was the only “slow” part of my solution there — moving the management data
back and forth was vile.

You also need to think what the article means by a “connection”.  Modulo
TCP port exhaustion and memory resources, the actual effort to maintain an
*idle* connection is negligible, especially if one uses an O(1) lookup
(e.g. a hash table) to map between remote address and connection state.
What is *hard* is balancing *traffic*.  If all 10M connections are
demanding service, the workload is going to be extreme.

This is why most network hardware specs give numbers in terms of packets
per second when talking about performance, and connection limits are only
used when discussing limitations in the amount of state keeping that the
device has (such as the sizes of NAT tables or somesuch).

- Garrett

On Mon, Dec 4, 2017 at 2:56 PM SGSeven Algebraf <a1rex2003@xxxxxxxxx> wrote:

I found it interesting:

http://highscalability.com/blog/2013/5/13/the-secret-to-10-million-concurrent-connections-the-kernel-i.html

References:
- [nanomsg] The Secret To 10 Million Concurrent Connections....
  - From: SGSeven Algebraf

[nanomsg] Re: The Secret To 10 Million Concurrent Connections....

Other related posts: