[nanomsg] Re: nn_close() of nanomsg socket hangs forever

  • From: Jack Dunaway <jack@xxxxxxxxxxxxxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Sat, 31 Jan 2015 21:38:12 -0600

Jason, I have also experienced this hang in `nn_close()`, and so far as I
can diagnose, it's a race condition that exists within both
`nn_sock_send()` and `nn_sock_recv()` that manifests only when `nn_close()`
is called concurrently from another thread.

There exists a little bit of work to fix this issue you can casually peruse
(don't take it too seriously, because it doesn't work yet!) here:
https://github.com/wirebirdlabs/featherweight-nanomsg/commit/d0ebdeaf7d92e4c070aba599d6af871fe9808a5d

I believe the race condition is this -- both the sock_recv and sock_send
functions might loop indefinitely within the `while (1)` loop, yet within
this loop, the context is released and captured again.

I have also tried zombifying as part of `nn_close()` in an attempt to
cleanly exit the blocking I/O function, but that did not seem to help. If
anything, it allowed `nn_close()` to continue past the semaphore capture
and free the socket, causing the blocking I/O to fault trying to access
freed memory. Yikes!

Hope this helps -- and let's keep bouncing around ideas,
Jack R. Dunaway | Wirebird Labs LLC



On Sat, Jan 31, 2015 at 9:19 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:

> Update: I got a stack trace from gdb. It appears to be hung in
> nn_sem_wait(), at src/utils/sem.c:159, which is a call:
>
> rc = sem_wait (&self->sem); // src/utils/sem.c:159 hangs here.
>
> So my earlier diagnosis was likely incorrect. It seems we have a logic bug
> instead.
>
> (gdb) *bt*
>
> #0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
>
> #1  0x00007ffff7dd0eeb in nn_sem_wait (self=self@entry=0x7fffb4017a88) at
> src/utils/sem.c:159
>
> #2  0x00007ffff7dca6c2 in nn_sock_term (self=0x7fffb40179b0) at
> src/core/sock.c:202
>
> #3  0x00007ffff7dc7837 in nn_close (s=31) at src/core/global.c:574
>
> #4  0x0000000000401d7b in _cgo_14c45440a8bc_C2func_nn_close
> (v=0xc2094182a0)
>
>     at /home/jaten/go/src/github.com/glycerine/go-nanomsg/nanomsg.go:61
>
> #5  0x0000000000489ca5 in asmcgocall () at
> /home/jaten/pkg/go1.4.1/go/src/runtime/asm_amd64.s:665
>
> #6  0x0000000000000008 in ?? ()
>
> #7  0x000000c20913e000 in ?? ()
>
> #8  0x000000000044e749 in runtime.cgocall_errno (fn=0x0, arg=0x0,
> ~r2=4204019)
>
>     at /home/jaten/pkg/go1.4.1/go/src/runtime/cgocall.go:117
>
> #9  0x000000000047e804 in runtime.mstart () at
> /home/jaten/pkg/go1.4.1/go/src/runtime/proc.c:836
>
> #10 0x00000000004025f3 in crosscall_amd64 () at
> /home/jaten/pkg/go1.4.1/go/src/runtime/cgo/gcc_amd64.S:35
>
> #11 0x0000000000000003 in ?? ()
>
> #12 0x0000000000000000 in ?? ()
>
> (gdb)
>
> On Sat, Jan 31, 2015 at 6:38 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:
>
>> In my application, this doesn't happen for a while, but then after a
>> while, the server doing an nn_close() on a nanomsg socket hangs forever.
>>
>> I read in close 2 man page:
>>
>>        When  dealing with sockets, you have to be sure that there is no
>> *recv*(2) still blocking on it on
>>
>>        another thread, otherwise it might block forever, since no more
>> messages will be  sent  via  the
>>
>>        socket.  Be  sure  to  use  *shutdown*(2) to shut down all parts
>> the connection before closing the
>>
>>        socket.
>>
>>
>> Moreover I see this example discussion [the answer by Joseph Quinsey
>> <http://stackoverflow.com/users/318716/joseph-quinsey>] of how to
>> properly close a socket:
>>
>>
>> http://stackoverflow.com/questions/12730477/close-is-not-closing-socket-properly
>>
>> Mr. Quinsey suggests that there are three (3) steps needed to
>> successfully close without hanging:
>>
>> a) getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len)); // to clear
>> any error on the socket
>>
>> b) shutdown(fd, SHUT_RDWR); // to terminate reliable delivery
>>
>> c) close(fd); // finally
>>
>>
>> I don't see nanomsg doing a) or b), so I tend to think this is a bug in
>> the nn_close() implimentation, and these two steps should be added.
>>
>> Thoughts?
>>
>>
>> Thanks!
>>
>> - Jason
>>
>
>
>
> --
>
> Best regards,
> Jason
>
> --
> Jason E. Aten, Ph.D.
> j.e.aten@xxxxxxxxx
> 650-429-8602
> linkedin: https://www.linkedin.com/pub/jason-e-aten-ph-d/18/313/45a
>

Other related posts: