[nanomsg] Re: Introduction and questions

  • From: Martin Sustrik <sustrik@xxxxxxxxxx>
  • To: jimmy frasche <soapboxcicero@xxxxxxxxx>
  • Date: Tue, 25 Jun 2013 07:19:09 +0200

On 25/06/13 01:10, jimmy frasche wrote:

Let's say that for some reason I have some super fancy custom SP that
has some similarities to pub/sub on some level, just enough that I can
handle attaching a sub socket to the topology (but I have no way of
handling a pub socket, for whatever reason). The only thing the sub
socket needs to know is that "I can work with you" and the only thing
my SP cares about is are you one of my sockets or a socket I can work
with. If you say I'm pub/sub 2, specifically a sub, that information
is still there but I'm a sub 2 is the minimum information required.

Let's have a look at the whole SP thing from the philosophical point of view: What's a protocol, say REQ/REP? It's a specification of a distributed algorithm. The algorithm solves some specific problem. It constrains user in what he can do, but on the other hand delivers certain behaviour and guarantees the user can rely on. All the nodes in the topology cooperate to deliver desired behaviour.

Now, if you connect a socket to the topology which advertises itself as SUB but has different behaviour, you break the contract. The user cannot reason about the behaviour of the topology as a whole any more. He can't rely on the guarantees given by PUB/SUB specification any more. Etc.

That's why the protocol field is separate. By specifying PUB/SUB in this field you are basically saying "this node is going to play by to rules of PUB/SUB specification, implement the distributed algorithm specified therein and will cooperate with other PUB/SUB nodes to form a well-behaved topology."

In other words, if you want a protocol that's similar to PUB/SUB but differs slightly from it, define a new SP protocol with different protocol ID. Internally, you can of course re-use the PUB/SUB implementation if you find that useful.

4. Topology ID. So, for example, if you have two pub/sub topologies on your
network (e.g. stock quotes vs. stock trades) you want to assign them
different IDs so that node from one topology cannot be accidentally
connected to the other topology. This property needs some more thinking
about though.

That seems like something for the PUB/SUB protocol to deal with, not
the nanomsg protocol itself. Any intertangling of the two means that
possibly the nanomsg code has to be aware of the pub/sub code and vice
versa and that means you can't write one without the other and
maintenance of one is more likely to affect the other. That seems like
a bad road to go to down.

It could still be in the nanomsg header and separate in the
implementation if it's some blob of protocol specific bytes but the
protocol gives to nanomsg to package but then I don't see the
advantage of that over just letting the SP come up with its own header
for its own needs. Otherwise you either have a bunch of empty bits or
not enough to fit what you need.

First, it's a generic thing, not specific to PUB/SUB alone.

For example, if you are architecting a stock exchange you'll need following topologies:

1. Posting orders (REQ/REP)
2. Stock quote distribution (PUB/SUB)
3. Trade distribution (PUB/SUB)
4. Management of individual components (REQ/REP)
etc.

The goal of the topology ID is to prevent, for example, a management client connecting to the order book.

Additional advantage is that by specifying the topology IDs you suddenly have the network traffic categorised based on *business* criteria. Thus, with adequate tools, the network admin can check, for example, what's the bandwidth consumed by stock quote feed. Also, he can specify a bandwith limit for the stock quotes so that it doesn't exhaust the bandwith needed by other feeds.

I personally prefer binary encoding (e.g. fixed 8 byte header) as it makes
it easier for hardware to deal with it, even in high-volume scenarios
(backbone routers etc.)

Also, when there are new connection-less transports added, the header will
be included into each packet. Thus, making it as short as possible so avoid
excessive bandwidth overhead seems like a good idea.

Of course, UDP header could be binary while TCP header is text-based,
however, it kind of feels cleaner to strive for similar header style for
different transports.

Those are good arguments for packing the encoding tight as possible.
And the nanomsg format should be the same regardless of transport
(even if particular transports such as UDP require an extra transport
specific header before the nanomsg header)

I didn't consider connectionless transports. Perhaps the socket
type/version should go in the 'transport specific header' and/or a
transport specific handshake can determine a one-byte identifying
token to use in communication between that pair of sockets? Maybe that
last one is too complex and fiddly though.

More importantly, the line between "transport-specific" and "transport-agnostic" part if pretty blurry. And given that we are speaking about few byte headers here, I would just make the whole header transport-specific. That'll provide the most flexibility for the transports.

Unless the UDP thing really squashes it I think the socket type's name
being ASCII, even if the rest is binary is good, assuming it doesn't
have to be plastered on every message.

The problem with numbers is that the numbers have to be standardized
and even if they're used sequentially initially years on the (name,
version) to number table starts to get weird and troublesome to follow
as new protocols are added between new versions of old and soon you
have sockets being compatiable with 3, 27, 28, 104, and 5689, and that
map would have to be in the RFC. If two people come up with SPs on
their own and happen to choose the same ident someone's going to have
to switch their system over if either wants to open source their SP.
Likewise if I have a custom SP not worth open sourcing that uses ident
111 and a new nanomsg comes out that uses 111 for the new version of
sub sockets, I have to change it over even though my socket isn't
named sub.

Maybe the efficiency is worth having to keep a spreadsheet of (socket,
version) ->  ident map as part of the standard and having everyone else
work around that. I don't like it. I'd rather fritter a few extra
bytes on peace of mind, but I don't have to like it. Not my protocol:
but that's my two cents on the subject.

You'll have the same problem with textual names. The words that make sense as socket types are rather limited in number so you are going to get clashes.

In either case you need a central authority to keep the list of existing protocols/socket types. The obvious choice for that is IANA. (See, e.g. the list of TCP ports managed by IANA.) Till then we can just keep the table on the web page somewhere.

Each network connection has its own goroutine that owns said
connection, operates its state machine, and does any transport
specific operations necessary.

It communicates with a controller (one per nanomsg socket) that
handles the queue and message (un)packing, per socket type, and
communicates with the nanosocket.

The nanosocket just sends commands to the controller and receives
replies and is in whatever goroutine the client is using it in.

Martin, does that sound like the correct architecture once you clear
away all the low-level stuff?

I think there's an "endpoint" object missing. So, when you do "nn_bind ("tcp://127.0.0.1:5555") an endpoint is created, which, itself, has a list of connections.

Martin

Other related posts: