[nanomsg] Re: Generalizing pubsub distribution

  • From: Carlos Pita <carlosjosepita@xxxxxxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Sun, 13 Dec 2015 16:36:35 -0300

Hi Jack,

thank you very much for the link, I think the mastermirror sp could be
useful for my up-to-date requirement, I will give it a try. But sadly
that part is already covered pretty well by a lossy pubsub with short
enough queues, while the loseless requirement is not covered at all by
pubsub.

My use case is pretty common in machine learning. More or less, what I
have in the online setting is:

Stream 1--->1 Sampler 1--->N Modeler 1--->N Estimator

The sampler publishes domain observations into different "shards", the
modelers then publish vectorized observations into different model
topics, and the estimators consume those vectors to learn. The
connections between the components are established by topic.

In the online setting I can't really have permanent backpressure
without infinite queues, as there is no way to stop the world (here
being the market events stream). I find the standard pubsub lossy
behavior convenient as it allows for some flows to go faster than
others (which will eventually drop) and because the dropping is pretty
random too (at least random enough to cover my needs). Also, I drop
observations older than 5 minutes, to keep up with the times.

But then there is the offline setting: you have to train lots of
estimators with a big dump of recorded observations (say two months of
market events) for model selection or just for an estimetor selected
for the online setting to catchup before going online. The circuit
becomes:

Replayer 1--->N Modeler 1--->N Estimator

But the replayer is just too fast for the other components, which do
the heavy lifting. Everything else is the same, but the fact the
replayer goes berzerker is fatal for the system. Now I need some kind
of back pressure or to manually limit the input frequency (for
example, by using a leaky bucket). I dislike the manual solution very
much because I learned the hard way how unstable this frequency can be
(it depends on the number of cores, the number of modelers and
estimators, the number of features in each model, the number of
hyperparameters to explore, the number of opened tabs in my firefox...
you get the idea). I want the system to regulate itself and go as fast
as it can without dropping, say, more than 1% of observations. In
theory, maximizing my utility function will imply some tradeoff
between losses and speed, but in practice I'm ok with blocking until
some hardcoded sndtimeo is hit (the sndtimeo could be inf also).

I know there are many ways to implement something like that using
other more primitive sps as building blocks, I have thought about this
stuff dangerously too many days now. What I find really uncomfortable
is the fact that pubsub is almost there except for a little diff.
True, pubsub is mainly for fast broadcasting of messages to many
potentially slow and unreliable subscribers, so dropping makes a lot
of sense. But pubsub is also for routing based on topics, and this is
a different, non trivial, concern that IMHO ended up too tightly
coupled with the "fast distribution to shitty subscribers" concern.

Take for example amqp 0.9.1, which Martin knows much better than me
;). Let's consider rabbitmq. The exchange part is concerned with
routing, and the queue caps and flow control is concerned with the
delivery modality: drop on maxsize or ttl, block on (global) hwm. They
are more or less orthogonal aspects that you can combine as you fancy.
I understand that the sps idea is philosophically different as sps
attempt to capture real usage patterns instead of providing an
abstract, more taxonomically guided, orthogonal generalization; that
would left the user with the task of reimplementing common patterns
(reqrep, pubsub, etc.) himself each time. It's just that I think in
this particular case, the topic routing stuff is TOO coupled with the
delivery mode, and there is no simple way to reuse the topic routing
outside the lossy pubsub sp.

Hope I'm being clear, I'm no Oscar Wilde.

Cheers
--
Carlos





On Sun, Dec 13, 2015 at 3:46 PM, Jack Dunaway <jack@xxxxxxxxxxxxxxxx> wrote:

As it's more important for me to keep the estimators up to date than to
process every observation

There exists a branch on github which defines the "sync" scalability
protocol (SP):
https://github.com/nanomsg/nanomsg/commit/79c9783c02ef16a2ef3fef5b26d05053530d15db

In this protocol, the two endpoints "MASTER" and "MIRROR" form a 1:N
relationship, where a "MIRROR" is always able to request only the most
recent message "published" by the "MASTER".

Is this closer to what you're looking for? (I am interested to see this SP
mainlined, also having application domains where freshness is more important
than losslessness)

Kind regards,
Jack R. Dunaway | Wirebird Labs LLC


On Sun, Dec 13, 2015 at 12:32 PM, Carlos Pita <carlosjosepita@xxxxxxxxx>
wrote:

dist. But I think the protocol api could be all that's needed, as I
suggested in my last post. Is that right?

Now I notice protocol.h is not even installed by autotools, that makes
me think I'm barking the wrong tree since it doesn't look like an
interface for the nanomsg user but for the nanomsg developer.



Other related posts: