I’m working on the updating the SURVEYOR protocol stuff, and I’ve realized we have a pretty nasty problem in SP. The SP protocols are all potentially subject to routing loops. There is no TTL in them, *and* there can be loops. With some of the topologies, its possible to build tragically bad routing loops that are explosive in nature (think PUB/SUB, or worse, some of the BUS fabrics.) The very very most worst of these is my STAR protocol, where you can have have exponential packet storms that arise. I think there are some fairly simple fixes here, but they will require some things that some may object to: 1. Wire protocol changes. I don’t see how to fix these problems without fixing the borken wire protocol - *unless* we want to have intermediate nodes keep some kind of cache of every packet they’ve seen in the last time t. (5 seconds?) That seems tragically bad. 2. Expanded header sets - more in detail below. 3. A little more time / logic in processing packets. I can see several different approaches to this problem. Here they are: APPROACH I. a. Assume every node has a unique ID of some sort. I’m going to suggest 64-bit EUIs for now. (E.g. mac addresses). In the worse case a random value can be used — the risk of collision is sufficiently small that I’m not concerned about it. b. Each hop along the path (device) adds a pipe id, *and* its own 64-bit ID to the stack trace. That means that each intermediate node adds 96, or more likely 128, bits. (I’d choose 128 over 96.) Frankly if the pipe id space wasn’t so constrained, we could use 64bits by constraining pipe IDs to 16 bits, and use 48-bit OUIs. Except we also need a bit to identify the end of trace. c. When a node (device) sees its own id in the backtrace, it discards the message. (Logging the presence of a routing loop might be good, or at least bumping a stat.) pros: perfect filtering, entirely stateless cons: substantial additional header sizes, and increased processing overhead across devices APPROACH II. a. Instead of *every* node attaching its own ID, the *originating* node could attach an originator ID next to the request ID & the request ID must be monotonically increasing. b. Every node records the highest value it saw for each originator. If the difference is negative, and small (to account for wraparound), then the message is discarded. c. Intentional replays must bump the request ID (e.g. retries due to timeout) pros: minimal additional header content, “perfect” filtering cons: requires unique node ID, and requires devices to cache (and expire) originator/ID pairs, replays need additional request ID bump. APPROACH III. (Can be combined with I & II) a. Every intermediate device scans the headers, and counts the number of hops. b. If hop count > X (a configured limit) then just discard. pros: Easy to implement, mixes with others, can be implemented with no wire protocol change cons: Not complete - while it eliminates the worse tragedy, explosive expansions still possible. APPROACH IV. a. Every node keeps a record (perhaps a hash checksum?) of traffic seen “recently”, and discards duplicates pros: perfect filtering, no protocol changes required cons: excessive amounts of processing, explosive memory requirements — this is just a bad idea I’m interested to hear what folks think. For now I’m going to just use Approach III and basically punt on it. But I’m really favoring some other change to improve resilience to routing loops. I haven’t looked to see what (if anything) ZMQ has done to solve for this. - Garrett