[nanomsg] Re: nn_close() of nanomsg socket hangs forever

  • From: George Lambert <marchon@xxxxxxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Sun, 1 Feb 2015 12:23:08 -0500

Naive comment from me... in sock.c there is a comment

    /*  If nn_close() was already called there's no point in adjusting
the snd/rcv file descriptors. */

but if threads are waiting on updates is it possible that this
optimization is leading to a race condition?


https://github.com/wirebirdlabs/featherweight-nanomsg/blob/master/src/core/sock.c#L733-L738


I would try eliminating this optimization to see if it changes the
behavior at all.

*** I have not tried this -- I am just looking through the code for
why the race condition might exist ***

Hope that it helps.


static void nn_sock_onleave (struct nn_ctx *self)

{

    struct nn_sock *sock;

    int events;


    sock = nn_cont (self, struct nn_sock, ctx);


    /*  If nn_close() was already called there's no point in adjusting the

        snd/rcv file descriptors. */


    if (nn_slow (sock->state != NN_SOCK_STATE_ACTIVE))

        return;

On Sat, Jan 31, 2015 at 10:38 PM, Jack Dunaway <jack@xxxxxxxxxxxxxxxx> wrote:
> Jason, I have also experienced this hang in `nn_close()`, and so far as I
> can diagnose, it's a race condition that exists within both `nn_sock_send()`
> and `nn_sock_recv()` that manifests only when `nn_close()` is called
> concurrently from another thread.
>
> There exists a little bit of work to fix this issue you can casually peruse
> (don't take it too seriously, because it doesn't work yet!) here:
> https://github.com/wirebirdlabs/featherweight-nanomsg/commit/d0ebdeaf7d92e4c070aba599d6af871fe9808a5d
>
> I believe the race condition is this -- both the sock_recv and sock_send
> functions might loop indefinitely within the `while (1)` loop, yet within
> this loop, the context is released and captured again.
>
> I have also tried zombifying as part of `nn_close()` in an attempt to
> cleanly exit the blocking I/O function, but that did not seem to help. If
> anything, it allowed `nn_close()` to continue past the semaphore capture and
> free the socket, causing the blocking I/O to fault trying to access freed
> memory. Yikes!
>
> Hope this helps -- and let's keep bouncing around ideas,
> Jack R. Dunaway | Wirebird Labs LLC
>
>
>
>
> On Sat, Jan 31, 2015 at 9:19 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:
>>
>> Update: I got a stack trace from gdb. It appears to be hung in
>> nn_sem_wait(), at src/utils/sem.c:159, which is a call:
>>
>> rc = sem_wait (&self->sem); // src/utils/sem.c:159 hangs here.
>>
>>
>> So my earlier diagnosis was likely incorrect. It seems we have a logic bug
>> instead.
>>
>> (gdb) bt
>>
>> #0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
>>
>> #1  0x00007ffff7dd0eeb in nn_sem_wait (self=self@entry=0x7fffb4017a88) at
>> src/utils/sem.c:159
>>
>> #2  0x00007ffff7dca6c2 in nn_sock_term (self=0x7fffb40179b0) at
>> src/core/sock.c:202
>>
>> #3  0x00007ffff7dc7837 in nn_close (s=31) at src/core/global.c:574
>>
>> #4  0x0000000000401d7b in _cgo_14c45440a8bc_C2func_nn_close
>> (v=0xc2094182a0)
>>
>>     at /home/jaten/go/src/github.com/glycerine/go-nanomsg/nanomsg.go:61
>>
>> #5  0x0000000000489ca5 in asmcgocall () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/asm_amd64.s:665
>>
>> #6  0x0000000000000008 in ?? ()
>>
>> #7  0x000000c20913e000 in ?? ()
>>
>> #8  0x000000000044e749 in runtime.cgocall_errno (fn=0x0, arg=0x0,
>> ~r2=4204019)
>>
>>     at /home/jaten/pkg/go1.4.1/go/src/runtime/cgocall.go:117
>>
>> #9  0x000000000047e804 in runtime.mstart () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/proc.c:836
>>
>> #10 0x00000000004025f3 in crosscall_amd64 () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/cgo/gcc_amd64.S:35
>>
>> #11 0x0000000000000003 in ?? ()
>>
>> #12 0x0000000000000000 in ?? ()
>>
>> (gdb)
>>
>>
>> On Sat, Jan 31, 2015 at 6:38 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:
>>>
>>> In my application, this doesn't happen for a while, but then after a
>>> while, the server doing an nn_close() on a nanomsg socket hangs forever.
>>>
>>> I read in close 2 man page:
>>>
>>>        When  dealing with sockets, you have to be sure that there is no
>>> recv(2) still blocking on it on
>>>
>>>        another thread, otherwise it might block forever, since no more
>>> messages will be  sent  via  the
>>>
>>>        socket.  Be  sure  to  use  shutdown(2) to shut down all parts the
>>> connection before closing the
>>>
>>>        socket.
>>>
>>>
>>> Moreover I see this example discussion [the answer by Joseph Quinsey] of
>>> how to properly close a socket:
>>>
>>>
>>> http://stackoverflow.com/questions/12730477/close-is-not-closing-socket-properly
>>>
>>> Mr. Quinsey suggests that there are three (3) steps needed to
>>> successfully close without hanging:
>>>
>>> a) getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len)); // to clear
>>> any error on the socket
>>>
>>> b) shutdown(fd, SHUT_RDWR); // to terminate reliable delivery
>>>
>>> c) close(fd); // finally
>>>
>>>
>>> I don't see nanomsg doing a) or b), so I tend to think this is a bug in
>>> the nn_close() implimentation, and these two steps should be added.
>>>
>>> Thoughts?
>>>
>>>
>>> Thanks!
>>>
>>> - Jason
>>
>>
>>
>>
>> --
>>
>> Best regards,
>> Jason
>>
>> --
>> Jason E. Aten, Ph.D.
>> j.e.aten@xxxxxxxxx
>> 650-429-8602
>> linkedin: https://www.linkedin.com/pub/jason-e-aten-ph-d/18/313/45a
>
>



-- 
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.

Other related posts: