[nanomsg] Re: Trying to implement "directory" pattern

  • From: Paul Colomiets <paul@xxxxxxxxxxxxxx>
  • To: Martin Sustrik <sustrik@xxxxxxxxxx>
  • Date: Mon, 25 Feb 2013 10:14:11 +0200

Hi Martin,


On Mon, Feb 25, 2013 at 7:51 AM, Martin Sustrik <sustrik@xxxxxxxxxx> wrote:

> On 24/02/13 12:57, Paul Colomiets wrote:
>
> >     Do I understnad it correctly: Is the goal to have multiple memcached
> >     instances, each with it's own name and a management console able to
> >     send a request to any of those instances based on a name?
> >
> >
> > I'm not sure I understand question correctly.  But a couple of remarks
> > follow.
> >
> > The goal is not to interconnect memcached instances (they should
> > probably be connected with the BUS pattern).
> >
> > I don't think there is a management console in memcached :). But in my
> > case, it's a single "cluster" of memcached instances, each one should
> > know each other to keep set of subscriptions distinct.
> >
> > The protocol can be used to request data by instance name, but it's more
> > than that. For example to scale memcached smoothly, you need to have
> > more "buckets" than instances. For 10 nodes we allocate, say 1000
> > buckets, and give each node 100 of them. When adding another node, we
> > move about 9 buckets from each node to the new one. This makes the load
> > distributed evenly across 11 nodes.
> >
> > I have more use cases than cache service, the memcached example is just
> > one that is easier to explain. But if anybody knows a better way to
> > implement that in nanomsg, I'm happy to listen :)
>
> Hm. I think I'm lost already :) What's bucket?
>
>
Ah, ok, I seem to skip too much when describing something.

Bucket is a virtual piece of dataset. To find memcached instance for a
given key, we calculate hash, then do a modulo operation:

bucket_index = hash(key) % number_buckets

Then, we have a table of bucket_index to memcached instance. For example
buckets 1..100 stored at node A, buckets 101...200 stored at node B. If we
had 2 nodes and 200 buckets and want to add third one, we leave buckets
1..66 for A, and 101..166 for B and assign buckets 67..100, 167..200 to
node C.

In my case bucket mapping is represented using subscriptions.


> Anyway, that should not prevent you from moving on with the
> implementation. See answers to your questions below.
>
>
Sure :) Thanks for the answers


>
>          2. When pipe is added to answer socket, all subscriptions should
>> be
>>         resent. Is there a way to put all the subscriptions into a pipe
>>         without
>>         any limits? Otherwise, I need either build a complex state
>>         machine to
>>         track which subscriptions are already resent, or bundle all
>>         subscriptions to a single message (inventing some ad-hoc message
>>         format). BTW, similar problem with just adding a subscription,
>>         when some
>>         output pipes are full. What crossroads does in this case?
>>
>>
>>     I think you are going to run into problems here.
>>
>>     Sending same data to multiple destinations reliably has the effect
>>     of slow/dead/malevolent consumer being able to block the whole
>>     topology. Subscriptions, being just the data in the end, experience
>>     the same problem.
>>
>>
>> I think you are too idealistic here. We know for sure that subscriptions
>> fit memory, so we can keep a memory buffer to send them.
>>
>
> Sure. What I am saying is that when sending messages to two destinations
> in parallel, reliably and if one of the destination application hangs, it
> will ultimately cause the other application hang. Thus the failure
> propagates sideways.
>
>
Eh, why? If subscriptions are buffered, the only thing that's not perfect,
is that subscriptions use twice the memory needed (one instance is the tree
itself, and the other is buffer), everything else works perfectly. I think
in 99.99% real cases subscription's memory usage is at least two order of
magnitude lower than memory usage of the application. At least in pub-sub.

For "directory" problem is slightly more complex as the system may accept
requests but be unable to send replies, being unable to unsubscribe at the
same time. But IMO that's not a kind of problem that nanomsg can solve in
any protocol.


>
>  Other question
>> is that, as far as I understand there is no API for buffering in
>> nanomsg, am I right? (more discussion below)
>>
>
> There's buffering code in inproc transport (inproc doesn't use a network
> protocol with tx/rx buffers so it has to buffer messages itself). See
> src/transports/inproc/**msgqueue.h
>
> I should probably move that class into src/utils, so that anyone can
> re-use it.
>
>
Yeah, that would be nice. So I need to create a pipe for subscriptions and
use worker thread to pick up messages one by one and put them into the
output pipe?


>
>      What it means that one hanged up ANSWER application could possibly
>>     block the whole datacenter.
>>
>>     You'll have to think out of the box here. For example, pub/sub
>>     solves the problem by allowing just one upstream pub socket per sub
>>     socket.
>>
>>
>> Sorry, but I don't understand how it solves problem.
>>
>
> It prevents propagating the failure sideways. Thus, every failure is
> always local to a sub-tree.
>
>
>  Does
>> setsockopt(...NN_SUBSCRIBE...) block until subscription is sent? Then
>> it's not documented and counter-intuitive. If setsockopt doesn't block,
>> then what stops you from filling all buffers by subscribing few thousand
>> times in a tight loop?
>>
>
> It doesn't block yet, because the subscription forwarding is not yet
> implemented, but yes, I would say it should block.
>
>
Well, I believe that all blocking calls should have a non-blocking
counter-parts. Having non-blocking counter-part for this one seems to
complicate API.


>>     So NN_RESEND_IVL is specific to REQ/REP pattern (NN_REQREP option
>>     level) and NN_SUBSCRIBE is specific to pub/sub (NN_PUBSUB option
>>     level). You should define an unique option level for your protocol
>>     and define the option constants as you see fit.
>>
>>
>> Well the name doesn't consist of protocol type. Seems strange. So I
>> should use NN_DIR_RESEND_IVL and NN_DIR_SUBSCRIBE or NN_RETRY_IVL and
>> NN_OFFER? Yeah, it's a policy issue not a technical one.
>>
>>     Separate option level will guarantee that you won't clash with other
>>     protocols.
>>
>>
>> If I by mistake call setsockopt(.., NN_SUB, NN_RESEND_IVL, ...) bad
>> things can happen :)
>>
>
> Yes. For consistency's sake, the option names should be prefixed by socket
> type name. However, to keep names simple and consistent with 0MQ I opted to
> ommit them here, ie. NN_SUBSCRIBE instead of NN_SUB_SUBSCRIBE.
>
> Should we change that?


NN_SUB_SUBSCRIBE could be shortened, but trying to find synonyms for
NN_RESEND_IVL feels ugly. The prefix also helps to find mistakes in the
code, so I'm +1 for it.


-- 
Paul

Other related posts: