[nanomsg] Re: nn_close() of nanomsg socket hangs forever

From: George Lambert <marchon@xxxxxxxxx>
To: nanomsg@xxxxxxxxxxxxx
Date: Sun, 1 Feb 2015 12:23:08 -0500
Naive comment from me... in sock.c there is a comment

    /*  If nn_close() was already called there's no point in adjusting
the snd/rcv file descriptors. */

but if threads are waiting on updates is it possible that this
optimization is leading to a race condition?


https://github.com/wirebirdlabs/featherweight-nanomsg/blob/master/src/core/sock.c#L733-L738


I would try eliminating this optimization to see if it changes the
behavior at all.

*** I have not tried this -- I am just looking through the code for
why the race condition might exist ***

Hope that it helps.


static void nn_sock_onleave (struct nn_ctx *self)

{

    struct nn_sock *sock;

    int events;


    sock = nn_cont (self, struct nn_sock, ctx);


    /*  If nn_close() was already called there's no point in adjusting the

        snd/rcv file descriptors. */


    if (nn_slow (sock->state != NN_SOCK_STATE_ACTIVE))

        return;

On Sat, Jan 31, 2015 at 10:38 PM, Jack Dunaway <jack@xxxxxxxxxxxxxxxx> wrote:
> Jason, I have also experienced this hang in `nn_close()`, and so far as I
> can diagnose, it's a race condition that exists within both `nn_sock_send()`
> and `nn_sock_recv()` that manifests only when `nn_close()` is called
> concurrently from another thread.
>
> There exists a little bit of work to fix this issue you can casually peruse
> (don't take it too seriously, because it doesn't work yet!) here:
> https://github.com/wirebirdlabs/featherweight-nanomsg/commit/d0ebdeaf7d92e4c070aba599d6af871fe9808a5d
>
> I believe the race condition is this -- both the sock_recv and sock_send
> functions might loop indefinitely within the `while (1)` loop, yet within
> this loop, the context is released and captured again.
>
> I have also tried zombifying as part of `nn_close()` in an attempt to
> cleanly exit the blocking I/O function, but that did not seem to help. If
> anything, it allowed `nn_close()` to continue past the semaphore capture and
> free the socket, causing the blocking I/O to fault trying to access freed
> memory. Yikes!
>
> Hope this helps -- and let's keep bouncing around ideas,
> Jack R. Dunaway | Wirebird Labs LLC
>
>
>
>
> On Sat, Jan 31, 2015 at 9:19 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:
>>
>> Update: I got a stack trace from gdb. It appears to be hung in
>> nn_sem_wait(), at src/utils/sem.c:159, which is a call:
>>
>> rc = sem_wait (&self->sem); // src/utils/sem.c:159 hangs here.
>>
>>
>> So my earlier diagnosis was likely incorrect. It seems we have a logic bug
>> instead.
>>
>> (gdb) bt
>>
>> #0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:86
>>
>> #1  0x00007ffff7dd0eeb in nn_sem_wait (self=self@entry=0x7fffb4017a88) at
>> src/utils/sem.c:159
>>
>> #2  0x00007ffff7dca6c2 in nn_sock_term (self=0x7fffb40179b0) at
>> src/core/sock.c:202
>>
>> #3  0x00007ffff7dc7837 in nn_close (s=31) at src/core/global.c:574
>>
>> #4  0x0000000000401d7b in _cgo_14c45440a8bc_C2func_nn_close
>> (v=0xc2094182a0)
>>
>>     at /home/jaten/go/src/github.com/glycerine/go-nanomsg/nanomsg.go:61
>>
>> #5  0x0000000000489ca5 in asmcgocall () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/asm_amd64.s:665
>>
>> #6  0x0000000000000008 in ?? ()
>>
>> #7  0x000000c20913e000 in ?? ()
>>
>> #8  0x000000000044e749 in runtime.cgocall_errno (fn=0x0, arg=0x0,
>> ~r2=4204019)
>>
>>     at /home/jaten/pkg/go1.4.1/go/src/runtime/cgocall.go:117
>>
>> #9  0x000000000047e804 in runtime.mstart () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/proc.c:836
>>
>> #10 0x00000000004025f3 in crosscall_amd64 () at
>> /home/jaten/pkg/go1.4.1/go/src/runtime/cgo/gcc_amd64.S:35
>>
>> #11 0x0000000000000003 in ?? ()
>>
>> #12 0x0000000000000000 in ?? ()
>>
>> (gdb)
>>
>>
>> On Sat, Jan 31, 2015 at 6:38 PM, Jason E. Aten <j.e.aten@xxxxxxxxx> wrote:
>>>
>>> In my application, this doesn't happen for a while, but then after a
>>> while, the server doing an nn_close() on a nanomsg socket hangs forever.
>>>
>>> I read in close 2 man page:
>>>
>>>        When  dealing with sockets, you have to be sure that there is no
>>> recv(2) still blocking on it on
>>>
>>>        another thread, otherwise it might block forever, since no more
>>> messages will be  sent  via  the
>>>
>>>        socket.  Be  sure  to  use  shutdown(2) to shut down all parts the
>>> connection before closing the
>>>
>>>        socket.
>>>
>>>
>>> Moreover I see this example discussion [the answer by Joseph Quinsey] of
>>> how to properly close a socket:
>>>
>>>
>>> http://stackoverflow.com/questions/12730477/close-is-not-closing-socket-properly
>>>
>>> Mr. Quinsey suggests that there are three (3) steps needed to
>>> successfully close without hanging:
>>>
>>> a) getsockopt(fd, SOL_SOCKET, SO_ERROR, (char *)&err, &len)); // to clear
>>> any error on the socket
>>>
>>> b) shutdown(fd, SHUT_RDWR); // to terminate reliable delivery
>>>
>>> c) close(fd); // finally
>>>
>>>
>>> I don't see nanomsg doing a) or b), so I tend to think this is a bug in
>>> the nn_close() implimentation, and these two steps should be added.
>>>
>>> Thoughts?
>>>
>>>
>>> Thanks!
>>>
>>> - Jason
>>
>>
>>
>>
>> --
>>
>> Best regards,
>> Jason
>>
>> --
>> Jason E. Aten, Ph.D.
>> j.e.aten@xxxxxxxxx
>> 650-429-8602
>> linkedin: https://www.linkedin.com/pub/jason-e-aten-ph-d/18/313/45a
>
>



-- 
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.
[nanomsg] Re: nn_close() of nanomsg socket hangs forever

Other related posts: