[nanomsg] Re: Trying to implement "directory" pattern

  • From: Paul Colomiets <paul@xxxxxxxxxxxxxx>
  • To: Martin Sustrik <sustrik@xxxxxxxxxx>
  • Date: Sun, 24 Feb 2013 13:57:12 +0200

Hi Martin,

On Sat, Feb 23, 2013 at 8:19 AM, Martin Sustrik <sustrik@xxxxxxxxxx> wrote:

> Hi Paul,
>
>
>  I'm trying to implement the following message pattern, I call it
>> "directory" (have a better name?).
>>
>> Directory pattern has two socket types: ASK and ANSWER. They are mostly
>> analogous to REQ and REP types respectively. The difference is that
>> ANSWER socket can choose what requests it serves. This works by setting
>> SUBSCRIBE option on the socket, which sends the subscription to ASK
>> socket, similarly to how it's done for PUB/SUB sockets in crossroads IO
>> (with subscription forwarding). ASK socket keeps a trie of prefixes and
>> for each node of the trie a list of pipes (ANSWER endpoints) is kept to
>> load-balance between them. So each message matches against trie to find
>> a subset of pipes that can be used to serve a request.
>>
>
> This looks like merging multiple REQ/REP topologies into a single one.
> What's the ultimate goal here? To limit the number of open TCP ports?


The goal is to make it easy to build a topology with *stateful* nodes. The
complex part of building it with a bunch of topologies is that every user
must be aware of how it works.

For example, let's imagine I've developed a memcached-like cache service.
Now to use it in some real projects I must use do one of the following:

1. Expose whole configuration and connection management to user (i.e.
describe it thoughfully, implement in variuos languages, etc.)

2. Hide everything behind broker. Which increases infrastructure complexity
and actual latency (the latter may be crucial for memcached-like service)

Note how the real memcached chose 1st approach. And everybody invents it's
own way to update memcached configuration at every node.

Having the pattern implemented, I just declare that you can access the
service by ASK socket, and it works like everything else with nanomsg (you
can use either bind or connect, use devices, etc.)



>  2. When pipe is added to answer socket, all subscriptions should be
>> resent. Is there a way to put all the subscriptions into a pipe without
>> any limits? Otherwise, I need either build a complex state machine to
>> track which subscriptions are already resent, or bundle all
>> subscriptions to a single message (inventing some ad-hoc message
>> format). BTW, similar problem with just adding a subscription, when some
>> output pipes are full. What crossroads does in this case?
>>
>
> I think you are going to run into problems here.
>
> Sending same data to multiple destinations reliably has the effect of
> slow/dead/malevolent consumer being able to block the whole topology.
> Subscriptions, being just the data in the end, experience the same problem.
>
>
I think you are too idealistic here. We know for sure that subscriptions
fit memory, so we can keep a memory buffer to send them. Other question is
that, as far as I understand there is no API for buffering in nanomsg, am I
right? (more discussion below)


> What it means that one hanged up ANSWER application could possibly block
> the whole datacenter.
>
> You'll have to think out of the box here. For example, pub/sub solves the
> problem by allowing just one upstream pub socket per sub socket.


Sorry, but I don't understand how it solves problem. Does
setsockopt(...NN_SUBSCRIBE...) block until subscription is sent? Then it's
not documented and counter-intuitive. If setsockopt doesn't block, then
what stops you from filling all buffers by subscribing few thousand times
in a tight loop?


> Other tricks are: Allowing for unreliability. Not sending at all, doing
> filtering locally. Etc.
>
>

Unreliability is not a nice option, since it may lead to unavailiable parts
of the topology.

Filtering locally is not an option. Look at the diagram in the previous
post. If I would send a request to all the nodes, requests with "computers"
would be processed by two nodes B and C simultaneously. It's not scalable.

In the real life 99.9% of the time output buffers in ANSWER sockets will be
empty, and subscription traffic will be low. But still we can't drop
connection when buffsystemder is full, because there are few legitimate
scenarious when it's crucial not to drop connection:

1. On first connect, you probably set many subscriptions before connecting
2. On reconnect, all subscriptions must be resent, even if there are
thousand ones. I may happen once a year, but still must work
3. On cluster reorganization hundreds of subscriptions may be updated,
especially on intermediate devices
4. Just after sending a big reply, which doesn't fit network buffer you may
update some subscription, even smallest one may be blocked by a reply.

So, in the end, what stops us implement buffering, at least for internal
messages?


>  3. I'm going to use NN_RESEND_IVL and NN_SUBSCRIBE options, but they
>> have same integer value. So I either need to reinvent names, or we need
>> to set unique numbers for built-in socket options. I think the latter is
>> preferable, as it also less error prone (can't setsockopt(NN_RESEND_IVL)
>> for SUB socket)
>>
>
> Ah, there are socket option levels now, same as in POSIX.
>
>
Yeah, I know.


> So NN_RESEND_IVL is specific to REQ/REP pattern (NN_REQREP option level)
> and NN_SUBSCRIBE is specific to pub/sub (NN_PUBSUB option level). You
> should define an unique option level for your protocol and define the
> option constants as you see fit.


Well the name doesn't consist of protocol type. Seems strange. So I should
use NN_DIR_RESEND_IVL and NN_DIR_SUBSCRIBE or NN_RETRY_IVL and NN_OFFER?
Yeah, it's a policy issue not a technical one.


> Separate option level will guarantee that you won't clash with other
> protocols.
>
>
If I by mistake call setsockopt(.., NN_SUB, NN_RESEND_IVL, ...) bad things
can happen :)


>
>  4. Martin, could you describe a bit, how chunks work? I need to mark
>> messages for being either subscriptions or request/reply, as well as
>> keep a chain of routing data like for req-rep. Any suggestions?
>>
>
> OK. Basically, message is composed of header and body. Each is represented
> as "chunk". "chunk" is just a binary buffer.
>
> From the transport you'll get a message with just the body and no header.
> What you have to do is to separate the message into header and body in raw
> socket's recv() function.
>
> So, you initilaise the header to appropriate size, copy the header data
> from the body chunk to the header chunk and trim the header from the body
> chunk (trim -- as opposed to copy -- allows to retain zero-copy guarantees).
>
> Note that inproc transport passes your messages in parsed state (header
> and body being separate) so you don't have to do the above for inproc.


Is message split done by each protocol on its own? Is there length-prefixed
or some other generic format for message headers that I can reuse?


-- 
Paul

Other related posts: