[nanomsg] loop detection

  • From: Garrett D'Amore <garrett@xxxxxxxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Wed, 18 Feb 2015 20:13:26 -0800

I’m working on the updating the SURVEYOR protocol stuff, and I’ve realized we 
have a pretty nasty problem in SP.

The SP protocols are all potentially subject to routing loops.  There is no TTL 
in them, *and* there can be loops.  With some of the topologies, its possible 
to build tragically bad routing loops that are explosive in nature (think 
PUB/SUB, or worse, some of the BUS fabrics.)  The very very most worst of these 
is my STAR protocol, where you can have have exponential packet storms that 
arise.

I think there are some fairly simple fixes here, but they will require some 
things that some may object to:

        1. Wire protocol changes.  I don’t see how to fix these problems 
without fixing the borken wire protocol - *unless* we want to have intermediate 
nodes keep some kind of cache of every packet they’ve seen in the last time t.  
(5 seconds?)  That seems tragically bad.
        2. Expanded header sets - more in detail below.
        3. A little more time / logic in processing packets.

I can see several different approaches to this problem.  Here they are:

APPROACH I.

        a. Assume every node has a unique ID of some sort.  I’m going to 
suggest 64-bit EUIs for now.  (E.g. mac addresses).  In the worse case a random 
value can be used — the risk of collision is sufficiently small that I’m not 
concerned about it.
        b. Each hop along the path (device) adds a pipe id, *and* its own 
64-bit ID to the stack trace.  That means that each intermediate node adds 96, 
or more likely 128, bits.  (I’d choose 128 over 96.)  Frankly if the pipe id 
space wasn’t so constrained, we could use 64bits by constraining pipe IDs to 16 
bits, and use 48-bit OUIs.  Except we also need a bit to identify the end of 
trace.  
        c. When a node (device) sees its own id in the backtrace, it discards 
the message.  (Logging the presence of a routing loop might be good, or at 
least bumping a stat.)

        pros: perfect filtering, entirely stateless
        cons: substantial additional header sizes, and increased processing 
overhead across devices


APPROACH II.

        a. Instead of *every* node attaching its own ID, the *originating* node 
could attach an originator ID next to the request ID & the request ID must be 
monotonically increasing.
        b. Every node records the highest value it saw for each originator.  If 
the difference is negative, and small (to account for wraparound), then the 
message is discarded.
        c. Intentional replays must bump the request ID (e.g. retries due to 
timeout)

        pros: minimal additional header content, “perfect” filtering
        cons: requires unique node ID, and requires devices to cache (and 
expire) originator/ID pairs, replays need additional request ID bump.


APPROACH III.  (Can be combined with I & II)

        a. Every intermediate device scans the headers, and counts the number 
of hops. 
        b. If hop count > X (a configured limit) then just discard.

        pros:  Easy to implement, mixes with others, can be implemented with no 
wire protocol change
        cons: Not complete - while it eliminates the worse tragedy, explosive 
expansions still possible.

APPROACH IV.

        a. Every node keeps a record (perhaps a hash checksum?) of traffic seen 
“recently”, and discards duplicates

        pros: perfect filtering, no protocol changes required
        cons: excessive amounts of processing, explosive memory requirements — 
this is just a bad idea

I’m interested to hear what folks think.

For now I’m going to just use Approach III and basically punt on it.  But I’m 
really favoring some other change to improve resilience to routing loops.

I haven’t looked to see what (if anything) ZMQ has done to solve for this.

        - Garrett


Other related posts: