[nanomsg] Re: [PATCH] Terminate astream if initializing usock failed

  • From: Nir Soffer <nirsof@xxxxxxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Tue, 12 Mar 2013 15:22:11 +0200

On Tue, Mar 12, 2013 at 3:20 PM, Nir Soffer <nirsof@xxxxxxxxx> wrote:

>
>
> On Tue, Mar 12, 2013 at 1:09 PM, Martin Sustrik <sustrik@xxxxxxxxxx>wrote:
>
>> On 2013-03-12 11:49, Nir Soffer wrote:
>>
>>> On Tue, Mar 12, 2013 at 12:09 PM, Martin Sustrik <sustrik@xxxxxxxxxx>
>>> wrote:
>>>
>>>  On 2013-03-12 10:59, Nir Soffer wrote:
>>>>
>>>>  On OS X the scenario is this:
>>>>
>>>>>
>>>>> 1. client connects and close the socket
>>>>> 2. server accept a new socket
>>>>> 3. server tune the socket
>>>>> 4. setsockopt fail with EINVAL (looks like a bug in OS X, as the value
>>>>> is
>>>>> valid (e.g. 128K receive buffer)
>>>>>
>>>>>
>>>> Yuck. That's an ugly behaviour. Anyway, what if we simply ignored the
>>>> EINVAL error from all those setsockopts, when on OSX?
>>>>
>>>
>>>
>>> What will happen if we fail to make the socket non-blocking, and then we
>>> try to initialize the stream with a blocking socket?
>>>
>>> I think that failing fast is better in most cases. What I don't like
>>> about
>>> this patch is that we cannot detect a bug in the library. For example, if
>>> we change the code to set a (real) invalid value, we will simply fail to
>>> handle all connections, instead of asserting.
>>>
>>
>> Yeah. I hate that we can't distinguish real "errorneous value" problem
>> (nanomsg should assert) from "peer disconnected" problem (nanomsg should
>> close the socket in decent manner).
>>
>> Anyway, given that we are not able to distinguish between the two, I
>> think that only option is to assume that there's no bug in the related
>> library code and consider the error to mean "peer disconnected".
>>
>>
>>  What if we check EINVAL, then fail with -EINVAL on OS X (and other
>>> platform
>>> that show this error), and assert on Linux?
>>>
>>
>> AFAICS it's the same as ignoring the error on OSX. The assumption is that
>> the peer have disconnected. If we ignore the error next send/recv call on
>> socket fails and socket will be shut down. If we return -EINVAL, the socket
>> will be shut down. Same thing.
>>
>>
>>  I've investigated a little bit further and the problem is in stream.c,
>>>> lines 127-130. There's a send operation launched. If sending fails, the
>>>> callback deallocates the stream object. The execution then continues to
>>>> line 130 where recv() is done on invalid (recently deallocated) socket.
>>>>
>>>
>>>
>>> This explain the impossible crash :-)
>>>
>>> Synchronous failures through callbacks are really horrible to debug - I
>>> was
>>> bitten by such issues many times. It would be much easier if all errors
>>> that invoke a callback are asynchrounous. For example, if the send
>>> operation fails, we set a flag and wake up the event loop, so the
>>> callback
>>> on the next iteration of the event loop.
>>>
>>
>> Yes. Same experience with callbacks here. (ZeroMQ is a callback hell
>> internally.) I would like to do it in a different way for nanomsg, however,
>> it's not a trivial patch, so I've been delaying dealing with this issue
>> till I have some uninterrupted time to hack on it.
>>
>> Few thoughts on the mechanism: When AIO subsystem wants to send a
>> notification it should check whether it is currently in a worker thread or
>> in user's thread (there's a function for determining that already). In the
>> former case it should just invoke the callback. In the latter case it
>> should send an nn_event (already implemented) to the worker thread. The
>> worker thread, on reception of the event should call the callback.
>>
>> How does that sound?
>
>
> The problem is the the callback which calls you back when you do not
> expect it:
>
> obj_do_this (object)
>     deeper (object)
>         even_deeper (object)
>             // something failed here
>             obj_error_callback (object)
>                 obj_term (object)
>                     free (object)
> obj_do_that (object) -- crash - now if you are lucky, or later if not.
>
> Lets assume that obj_do_this() should invoke the error callback, since it
> is called from some other event, so we cannot return an error code.
>
> We can do is:
>
> obj_do_this (object)
>     deeper (object)
>         even_deeper (object)
>             // something failed here
>             mark object as invalid
>             send event that will run obj_error_callback later
> obj_do_that (object)
>     return error if object is invalid
>
> In the next event loop iteration:
>
> obj_error_callback (object)
>     terminate and free object
>
> A simpler solution would be to always return an error code, so if a
> function failed and terminated the object, the next function can detect it.
>
>
The problem can happen on both the user thread of the event loop thread, so
it does not help to check on which thread you are running.

Other related posts: