Naive comment from me... in sock.c there is a comment /* If nn_close() was already called there's no point in adjusting the snd/rcv file descriptors. */ but if threads are waiting on updates is it possible that this optimization is leading to a race condition? https://github.com/wirebirdlabs/featherweight-nanomsg/blob/master/src/core/sock.c#L733-L738 I would try eliminating this optimization to see if it changes the behavior at all. *** I have not tried this -- I am just looking through the code for why the race condition might exist *** Hope that it helps. static void nn_sock_onleave (struct nn_ctx *self) { struct nn_sock *sock; int events; sock = nn_cont (self, struct nn_sock, ctx); /* If nn_close() was already called there's no point in adjusting the snd/rcv file descriptors. */ if (nn_slow (sock->state != NN_SOCK_STATE_ACTIVE)) return; On Sat, Jan 31, 2015 at 10:38 PM, Jack Dunaway <jack@xxxxxxxxxxxxxxxx> wrote: > Jason, I have also experienced this hang in `nn_close()`, and so far as I > can diagnose, it's a race condition that exists within both `nn_sock_send()` > and `nn_sock_recv()` that manifests only when `nn_close()` is called > concurrently from another thread. > > There exists a little bit of work to fix this issue you can casually peruse > (don't take it too seriously, because it doesn't work yet!) here: > https://github.com/wirebirdlabs/featherweight-nanomsg/commit/d0ebdeaf7d92e4c070aba599d6af871fe9808a5d > > I believe the race condition is this -- both the sock_recv and sock_send > functions might loop indefinitely within the `while (1)` loop, yet within > this loop, the context is released and captured again. > > I have also tried zombifying as part of `nn_close()` in an attempt to > cleanly exit the blocking I/O function, but that did not seem to help. If > anything, it allowed `nn_close()` to continue past the semaphore capture and > free the socket, causing the blocking I/O to fault trying to access freed > memory. Yikes! > > Hope this helps -- and let's keep bouncing around ideas, > Jack R. Dunaway | Wirebird Labs LLC > > > > > On Sat, Jan 31, 2015 at 9:19 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote: >> >> Update: I got a stack trace from gdb. It appears to be hung in >> nn_sem_wait(), at src/utils/sem.c:159, which is a call: >> >> rc = sem_wait (&self->sem); // src/utils/sem.c:159 hangs here. >> >> >> So my earlier diagnosis was likely incorrect. It seems we have a logic bug >> instead. >> >> (gdb) bt >> >> #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86 >> >> #1 0x00007ffff7dd0eeb in nn_sem_wait (self=self@entry=0x7fffb4017a88) at >> src/utils/sem.c:159 >> >> #2 0x00007ffff7dca6c2 in nn_sock_term (self=0x7fffb40179b0) at >> src/core/sock.c:202 >> >> #3 0x00007ffff7dc7837 in nn_close (s=31) at src/core/global.c:574 >> >> #4 0x0000000000401d7b in _cgo_14c45440a8bc_C2func_nn_close >> (v=0xc2094182a0) >> >> at /home/jaten/go/src/github.com/glycerine/go-nanomsg/nanomsg.go:61 >> >> #5 0x0000000000489ca5 in asmcgocall () at >> /home/jaten/pkg/go1.4.1/go/src/runtime/asm_amd64.s:665 >> >> #6 0x0000000000000008 in ?? () >> >> #7 0x000000c20913e000 in ?? () >> >> #8 0x000000000044e749 in runtime.cgocall_errno (fn=0x0, arg=0x0, >> ~r2=4204019) >> >> at /home/jaten/pkg/go1.4.1/go/src/runtime/cgocall.go:117 >> >> #9 0x000000000047e804 in runtime.mstart () at >> /home/jaten/pkg/go1.4.1/go/src/runtime/proc.c:836 >> >> #10 0x00000000004025f3 in crosscall_amd64 () at >> /home/jaten/pkg/go1.4.1/go/src/runtime/cgo/gcc_amd64.S:35 >> >> #11 0x0000000000000003 in ?? () >> >> #12 0x0000000000000000 in ?? () >> >> (gdb) >> >> >> On Sat, Jan 31, 2015 at 6:38 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote: >>> >>> In my application, this doesn't happen for a while, but then after a >>> while, the server doing an nn_close() on a nanomsg socket hangs forever. >>> >>> I read in close 2 man page: >>> >>> When dealing with sockets, you have to be sure that there is no >>> recv(2) still blocking on it on >>> >>> another thread, otherwise it might block forever, since no more >>> messages will be sent via the >>> >>> socket. Be sure to use shutdown(2) to shut down all parts the >>> connection before closing the >>> >>> socket. >>> >>> >>> Moreover I see this example discussion [the answer by Joseph Quinsey] of >>> how to properly close a socket: >>> >>> >>> http://stackoverflow.com/questions/12730477/close-is-not-closing-socket-properly >>> >>> Mr. Quinsey suggests that there are three (3) steps needed to >>> successfully close without hanging: >>> >>> a) getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len)); // to clear >>> any error on the socket >>> >>> b) shutdown(fd, SHUT_RDWR); // to terminate reliable delivery >>> >>> c) close(fd); // finally >>> >>> >>> I don't see nanomsg doing a) or b), so I tend to think this is a bug in >>> the nn_close() implimentation, and these two steps should be added. >>> >>> Thoughts? >>> >>> >>> Thanks! >>> >>> - Jason >> >> >> >> >> -- >> >> Best regards, >> Jason >> >> -- >> Jason E. Aten, Ph.D. >> j.e.aten@xxxxxxxxx >> 650-429-8602 >> linkedin: https://www.linkedin.com/pub/jason-e-aten-ph-d/18/313/45a > > -- P THINK BEFORE PRINTING: is it really necessary? This e-mail and its attachments are confidential and solely for the intended addressee(s). Do not share or use them without approval. If received in error, contact the sender and delete them.