[nanomsg] Re: Trying to implement "directory" pattern

  • From: Martin Sustrik <sustrik@xxxxxxxxxx>
  • To: Paul Colomiets <paul@xxxxxxxxxxxxxx>
  • Date: Thu, 28 Feb 2013 08:17:49 +0100

On 26/02/13 20:41, Paul Colomiets wrote:

    The problem with that is auto-reconnect. Subscriber sends a lot of
    subscriptions, more that the producer is able to accept. It is not
    processing them, so TCP pushback happens. Subscriber sees that the
    subscription stream is stuck and disconnects the peer. The producer
    tries to reconnect and immediately gets hit by a subscription storm.
    As so on ad infinitum.


That's why I've written "if connection does no progress". I mean
disconnect if no bytes where sent for 10 seconds, or something like
that. It means that any number of subscriptions can be sent, even if it
would take minutes to upload them. I think in all realistic situations
(up to thousands subscriptions in up to few seconds) it will work. There
is an edge case, when you create new subscriptions in the tight loop,
and publisher can't keep up with it. But I don't think it's a situation
that's need to be taken care of.

It's a single problem IMO. It can be formulated like this:

"Given limited tx buffer (whether in kernel or in user space) what should be done when it gets full and user still wants to send new subscription."

Btw, speaking of realistic situation, I've just spoke to guys who are handling 130,000,000 subscriptions in ZeroMQ :)

Anyway, the problem can be split into 2 parts:

1. How to manage pushback.
2. What to do when it can't be managed any more.

The options for the first are either relying on TCP (problem occurs when tx buffer limit is hit) or building a rate limiting algorithm on top (problem happens when the rate limit is exceeded) -- the latter being basically what you are proposing.

I would say that both are functionally equivalent (ie. the problem occurs when too much data is sent in too short a time) the only difference being that implementing rate limiting requires more work to be done.

The interesting part is what happens when the problem occurs (tx buffer full, rate limit exceeded). The options here are:

1. Drop => results in inconsistent message delivery
2. Pushback => hanged up publisher can stop the whole topology

There's also the "reconnect" option which is just an evil variation on pushback. Instead of waiting for sending the remaining few bytes, it disconnects, reconnect and tries to send the whole subscription set anew.

There seems to be no way out.

If you see any other solution to the problem, please let me know.

Martin


Other related posts: