Shixi, is that the line your stack trace is originating from? https://github.com/nanomsg/nanomsg/blob/master/src/transports/tcp/ctcp.c#L619 It does seem problematic to me that there is an errnum_assert() here. Martin/everyone: it would seem reasonable to return an error instead of aborting here, no? Thanks. Jason On Sat, Nov 15, 2014 at 7:28 AM, xreborner (Shixi Chen) <xreborner@xxxxxxxxx > wrote: > The problem is that, if it fails, it just aborts. So no chance to retry. > > On Sat, Nov 15, 2014 at 4:21 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote: > >> This. >> >> Echoing Matt's comment -- I just bind to port 0 using non-nanomsg socket >> calls (so the kernel picks a free port). Then I note the port, then close >> that socket and reopen on that port in nanomsg. Since there is still a >> short period in which that port might get taken, I also retry if again that >> fails, but usually it succeeds. >> >> >> On Fri, Nov 14, 2014 at 4:51 PM, Matt Howlett <matt.howlett@xxxxxxxxx> >> wrote: >> >>> >>> The behavior of nn_bind was also unexpected me. My work-around is to >>> find a free port outside of nanomsg, then immediately bind to it. I stagger >>> the start up of my workers (~1 worker per core per machine) so in practice >>> I never get a race condition. Not ideal, but it works. If you can control >>> when all of the processes that bind to random ports start up on each node, >>> you can do the same thing, though it sounds like your situation might be >>> more difficult. >>> >>> >>> >>> On Fri, Nov 14, 2014 at 10:30 PM, xreborner (Shixi Chen) < >>> xreborner@xxxxxxxxx> wrote: >>> >>>> Unfortunately, I'm running a distributed computation application on a >>>> cluster with thousands of machines, in each machine there could be multiple >>>> tasks are running in background and have occupied some random ports. If I >>>> just choose a random port (in fact, i use not only one port) and use it, >>>> there are roughly 1% probability to fail in one machine. If my application >>>> is running on 200 machines, then it almost always fail. >>>> >>>> On Fri, Nov 14, 2014 at 11:05 PM, Martin Sustrik <sustrik@xxxxxxxxxx> >>>> wrote: >>>> >>>>> -----BEGIN PGP SIGNED MESSAGE----- >>>>> Hash: SHA1 >>>>> >>>>> On 14/11/14 15:55, xreborner (Shixi Chen) wrote: >>>>> > So, if i don't know any port numbers that are available, it is >>>>> > impossible to use nanomsg? >>>>> >>>>> Yes. Although on a typical machine, almost all ports are unused, so >>>>> just picking one and using it tends to work. >>>>> >>>>> > My program is to be run on a remote cluster, where no port numbers >>>>> > are known to be reserved. I was using zeromq and my solution was to >>>>> > repeat calling zmq_bind. I'm considering to switch to nanomsg since >>>>> > it looks better (and also due to some problems in zeromq). Is there >>>>> > any plan to solve this problem in the future? >>>>> >>>>> I was thinking of implementing tcpmux (RFC 1) but it's not coming any >>>>> time soon. >>>>> >>>>> Martin >>>>> -----BEGIN PGP SIGNATURE----- >>>>> Version: GnuPG v1.4.11 (GNU/Linux) >>>>> >>>>> iQEcBAEBAgAGBQJUZho9AAoJENTpVjxCNN9YouMH/RI/d9AismH7RuEH7aY6oOQV >>>>> snl5ad/wZsupguf5uGtYfomnJOMtMrwLo+qEHK+u5JCWmBN73VikfJuJtwZs/lsg >>>>> umD1xt6tGvOyxmI1V1bzXkNASyUktPpjedA0xgbBXlw8KwsDTTKIRaVCwNQt+FND >>>>> tKKMHIQKJ9B0qmD8UrlT8fg1qwLsG/HUgr1JrkVw1+yLnaGXzwCdxWO49F3X+dEl >>>>> aXwIO1cZrcpB+hPb7lemn4pWQDa//JiIbE4wbg7aT4ecgIWFd4UheHQfSBr8ZniH >>>>> XjeGlJcJ4IDos9DzfNTKgj07lgGoMB+lt/7M+qr+Mh4AjJZTgYM11nGyp0ljpQg= >>>>> =jgr1 >>>>> -----END PGP SIGNATURE----- >>>>> >>>>> >>>> >>> >> >