Actually, reading the RFC in more detail, I see that REQ/REP specifies approach III with a value of 8 hops maximum. I’ll probably simply apply the same setting in mangos. Furthermore, I intend to make this event user visible either via statistic or logging. - Garrett > On Feb 18, 2015, at 8:13 PM, Garrett D'Amore <garrett@xxxxxxxxxx> wrote: > > I’m working on the updating the SURVEYOR protocol stuff, and I’ve realized we > have a pretty nasty problem in SP. > > The SP protocols are all potentially subject to routing loops. There is no > TTL in them, *and* there can be loops. With some of the topologies, its > possible to build tragically bad routing loops that are explosive in nature > (think PUB/SUB, or worse, some of the BUS fabrics.) The very very most worst > of these is my STAR protocol, where you can have have exponential packet > storms that arise. > > I think there are some fairly simple fixes here, but they will require some > things that some may object to: > > 1. Wire protocol changes. I don’t see how to fix these problems > without fixing the borken wire protocol - *unless* we want to have > intermediate nodes keep some kind of cache of every packet they’ve seen in > the last time t. (5 seconds?) That seems tragically bad. > 2. Expanded header sets - more in detail below. > 3. A little more time / logic in processing packets. > > I can see several different approaches to this problem. Here they are: > > APPROACH I. > > a. Assume every node has a unique ID of some sort. I’m going to > suggest 64-bit EUIs for now. (E.g. mac addresses). In the worse case a > random value can be used — the risk of collision is sufficiently small that > I’m not concerned about it. > b. Each hop along the path (device) adds a pipe id, *and* its own > 64-bit ID to the stack trace. That means that each intermediate node adds > 96, or more likely 128, bits. (I’d choose 128 over 96.) Frankly if the pipe > id space wasn’t so constrained, we could use 64bits by constraining pipe IDs > to 16 bits, and use 48-bit OUIs. Except we also need a bit to identify the > end of trace. > c. When a node (device) sees its own id in the backtrace, it discards > the message. (Logging the presence of a routing loop might be good, or at > least bumping a stat.) > > pros: perfect filtering, entirely stateless > cons: substantial additional header sizes, and increased processing > overhead across devices > > > APPROACH II. > > a. Instead of *every* node attaching its own ID, the *originating* node > could attach an originator ID next to the request ID & the request ID must be > monotonically increasing. > b. Every node records the highest value it saw for each originator. If > the difference is negative, and small (to account for wraparound), then the > message is discarded. > c. Intentional replays must bump the request ID (e.g. retries due to > timeout) > > pros: minimal additional header content, “perfect” filtering > cons: requires unique node ID, and requires devices to cache (and > expire) originator/ID pairs, replays need additional request ID bump. > > > APPROACH III. (Can be combined with I & II) > > a. Every intermediate device scans the headers, and counts the number > of hops. > b. If hop count > X (a configured limit) then just discard. > > pros: Easy to implement, mixes with others, can be implemented with no > wire protocol change > cons: Not complete - while it eliminates the worse tragedy, explosive > expansions still possible. > > APPROACH IV. > > a. Every node keeps a record (perhaps a hash checksum?) of traffic seen > “recently”, and discards duplicates > > pros: perfect filtering, no protocol changes required > cons: excessive amounts of processing, explosive memory requirements — > this is just a bad idea > > I’m interested to hear what folks think. > > For now I’m going to just use Approach III and basically punt on it. But I’m > really favoring some other change to improve resilience to routing loops. > > I haven’t looked to see what (if anything) ZMQ has done to solve for this. > > - Garrett > >