On Tue, Mar 12, 2013 at 3:20 PM, Nir Soffer <nirsof@xxxxxxxxx> wrote: > > > On Tue, Mar 12, 2013 at 1:09 PM, Martin Sustrik <sustrik@xxxxxxxxxx>wrote: > >> On 2013-03-12 11:49, Nir Soffer wrote: >> >>> On Tue, Mar 12, 2013 at 12:09 PM, Martin Sustrik <sustrik@xxxxxxxxxx> >>> wrote: >>> >>> On 2013-03-12 10:59, Nir Soffer wrote: >>>> >>>> On OS X the scenario is this: >>>> >>>>> >>>>> 1. client connects and close the socket >>>>> 2. server accept a new socket >>>>> 3. server tune the socket >>>>> 4. setsockopt fail with EINVAL (looks like a bug in OS X, as the value >>>>> is >>>>> valid (e.g. 128K receive buffer) >>>>> >>>>> >>>> Yuck. That's an ugly behaviour. Anyway, what if we simply ignored the >>>> EINVAL error from all those setsockopts, when on OSX? >>>> >>> >>> >>> What will happen if we fail to make the socket non-blocking, and then we >>> try to initialize the stream with a blocking socket? >>> >>> I think that failing fast is better in most cases. What I don't like >>> about >>> this patch is that we cannot detect a bug in the library. For example, if >>> we change the code to set a (real) invalid value, we will simply fail to >>> handle all connections, instead of asserting. >>> >> >> Yeah. I hate that we can't distinguish real "errorneous value" problem >> (nanomsg should assert) from "peer disconnected" problem (nanomsg should >> close the socket in decent manner). >> >> Anyway, given that we are not able to distinguish between the two, I >> think that only option is to assume that there's no bug in the related >> library code and consider the error to mean "peer disconnected". >> >> >> What if we check EINVAL, then fail with -EINVAL on OS X (and other >>> platform >>> that show this error), and assert on Linux? >>> >> >> AFAICS it's the same as ignoring the error on OSX. The assumption is that >> the peer have disconnected. If we ignore the error next send/recv call on >> socket fails and socket will be shut down. If we return -EINVAL, the socket >> will be shut down. Same thing. >> >> >> I've investigated a little bit further and the problem is in stream.c, >>>> lines 127-130. There's a send operation launched. If sending fails, the >>>> callback deallocates the stream object. The execution then continues to >>>> line 130 where recv() is done on invalid (recently deallocated) socket. >>>> >>> >>> >>> This explain the impossible crash :-) >>> >>> Synchronous failures through callbacks are really horrible to debug - I >>> was >>> bitten by such issues many times. It would be much easier if all errors >>> that invoke a callback are asynchrounous. For example, if the send >>> operation fails, we set a flag and wake up the event loop, so the >>> callback >>> on the next iteration of the event loop. >>> >> >> Yes. Same experience with callbacks here. (ZeroMQ is a callback hell >> internally.) I would like to do it in a different way for nanomsg, however, >> it's not a trivial patch, so I've been delaying dealing with this issue >> till I have some uninterrupted time to hack on it. >> >> Few thoughts on the mechanism: When AIO subsystem wants to send a >> notification it should check whether it is currently in a worker thread or >> in user's thread (there's a function for determining that already). In the >> former case it should just invoke the callback. In the latter case it >> should send an nn_event (already implemented) to the worker thread. The >> worker thread, on reception of the event should call the callback. >> >> How does that sound? > > > The problem is the the callback which calls you back when you do not > expect it: > > obj_do_this (object) > deeper (object) > even_deeper (object) > // something failed here > obj_error_callback (object) > obj_term (object) > free (object) > obj_do_that (object) -- crash - now if you are lucky, or later if not. > > Lets assume that obj_do_this() should invoke the error callback, since it > is called from some other event, so we cannot return an error code. > > We can do is: > > obj_do_this (object) > deeper (object) > even_deeper (object) > // something failed here > mark object as invalid > send event that will run obj_error_callback later > obj_do_that (object) > return error if object is invalid > > In the next event loop iteration: > > obj_error_callback (object) > terminate and free object > > A simpler solution would be to always return an error code, so if a > function failed and terminated the object, the next function can detect it. > > The problem can happen on both the user thread of the event loop thread, so it does not help to check on which thread you are running.