[nanomsg] Re: getaddrinfo_a() related memory leak and another issue was: Re: Re: What has changed since 0.2 in socket handling?

From: George Lambert <marchon@xxxxxxxxx>
To: nanomsg@xxxxxxxxxxxxx
Date: Wed, 21 Jan 2015 09:50:40 -0500

That looks to me like a very good catch.

George Lambert

On Wed, Jan 21, 2015 at 9:20 AM, Boszormenyi Zoltan <zboszor@xxxxx> wrote:

> Hi,
>
> when can I get a review on https://github.com/nanomsg/nanomsg/pull/356 ?
>
> I can't believe this leak only happens on Fedora 20 and 21.
> At least some other Linuxes should show the same problem.
>
> Thanks in advance,
> Zoltán Böszörményi
>
> 2014-12-21 07:20 keltezéssel, Boszormenyi Zoltan írta:
> > Hi,
> >
> > if you read the starting mail of this thread, you can see
> > a memory leak reported by Valgrind. Your reply was at
> >
> >
> //www.freelists.org/post/nanomsg/What-has-changed-since-02-in-socket-handling,1
> >
> > and wondered about the nature of the leak, i.e. whether
> > it's in GLIBC or nanomsg.
> >
> > The number of memory blocks leaked equals to the number of
> > getaddrinfo_a() calls and it can simply be plugged by calling
> > freeadrinfo() as in the attached patch.
> >
> > It became somewhat obvious after  reading the example
> > in the getaddrinfo() man page that you need to call freeaddrinfo()
> > on the result. But it's not done in
> src/transports/utils/dns_getaddrinfo_a.inc
> > at the moment and the getaddrinfo_a() man page doesn't
> > explicitly say you need to freeaddrinfo(->ar_result), it only says
> > "The elements of this structure correspond to the arguments of
> getaddrinfo(3).
> >  ...
> >  Finally, ar_result corresponds to the res argument; you do not need to
> initialize this ele‐
> >  ment, it will be automatically set when the request is resolved.
> >  ...
> > "
> >
> > Yesterday, I have tried disabling getaddrinfo_a() detection in
> configure.ac
> > to see whether it leaks the same way. To my surprise, I got an
> >
> > Assertion failed: reply && !reply->ai_next
> (src/transports/utils/dns_getaddrinfo.inc:112)
> >
> > when trying to nn_connect() to localhost. It turned out that GLIBC
> > returns the resolved 127.0.0.1 twice, both for getaddrinfo and
> getaddrinfo_a.
> > I haven't looked at the differences of the two returned structures
> > but there are indeed valid cases when more than one addresses
> > are returned, e.g.:
> >
> > $ host www.kernel.org
> > www.kernel.org is an alias for pub.all.kernel.org.
> > pub.all.kernel.org has address 149.20.4.69
> > pub.all.kernel.org has address 198.145.20.140
> > pub.all.kernel.org has address 199.204.44.194
> > pub.all.kernel.org has IPv6 address 2001:4f8:1:10:0:1991:8:25
> >
> > Considering this, the nn_assert() on line 112 in
> > src/transports/utils/dns_getaddrinfo.inc is misguided.
> >
> > Best regards,
> > Zoltán Böszörményi
> >
> > 2014-12-20 20:04 keltezéssel, Boszormenyi Zoltan írta:
> >> Hi again,
> >>
> >> 2014-11-29 08:27 keltezéssel, Boszormenyi Zoltan írta:
> >>> Hi,
> >>>
> >>> sorry for not replying your answer but I just re-subscribed recently
> >>> and I didn't receive the answer from the mailing list.
> >>>
> >>> I sent the test program in private that integrated networking
> >>> into a GLIB mainloop. The real code we use allows switching
> >>> between ZeroMQ 3 (3.2.4, to be exact) and nanomsg at
> >>> configure time and uses static inline wrappers and #define's
> >>> for this reason. We only use the REP/REP pattern at the moment.
> >>>
> >>> The currently attached test programs (obvious ones, really)
> >>> do exhibit the same problem I described in the first mail on
> >>> Fedora 20 and 21. Messaging stops after a few (2 to 8) thousand
> >>> messages.
> >> the last commit "Fix locking bug in nn_global_submit_statistics()"
> >> has fixed the lockup problem for REP/REQ.
> >>
> >> Thanks!
> >>
> >>> Similar code (or the wrapper API with GLIB mainloop integration)
> >>> that uses ZeroMQ didn't stop, I have run one test during the night
> >>> and after about 72 million packets, the program still runs stable
> >>> and without any leaks. Again, on ZeroMQ 3.2.4.
> >>>
> >>> Regarding the closed sockets in TIME_WAIT state, I noticed that
> >>> they slow down ZeroMQ, too, but don't make it lock up. Setting
> >>> these sysctl variables help eliminating the slowdown by instructing
> >>> the kernel to reuse those sockets more aggressively:
> >>>
> >>> net.ipv4.tcp_tw_recycle = 1
> >>> net.ipv4.tcp_tw_reuse = 1
> >>>
> >>> Unfortunately, this didn't help nanomsg.
> >>>
> >>> Best regards,
> >>> Zoltán Böszörményi
> >>>
> >>> 2014-11-21 21:46 keltezéssel, Boszormenyi Zoltan írta:
> >>>> Hi,
> >>>>
> >>>> I use nanomsg with a wrapper library that integrates the networking
> >>>> request-response pattern into the GLIB mainloop via
> >>>> nn_getsockopt(NN_SOL_SOCKET, NN_RCVFD).
> >>>>
> >>>> IIRC, it worked well and without any leaks back then with nanomsg
> 0.2-ish.
> >>>>
> >>>> Now, I have upgraded to 0.5 and e.g. on Fedora 20 and 21, my example
> >>>> programs lock up after some time. netstat shows there are many sockets
> >>>> in TIME_WAIT state even after both te client and server programs have
> quit.
> >>>>
> >>>> Also, this memory leak was observed on both Fedora 20 and 21:
> >>>>
> >>>> ==18504== 43,776 (21,888 direct, 21,888 indirect) bytes in 342 blocks
> are definitely lost
> >>>> in loss record 3,232 of 3,232
> >>>> ==18504==    at 0x4A0645D: malloc (in
> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
> >>>> ==18504==    by 0x3E902DA99C: gaih_inet (in /usr/lib64/libc-2.18.so)
> >>>> ==18504==    by 0x3E902DE38C: getaddrinfo (in /usr/lib64/libc-2.18.so
> )
> >>>> ==18504==    by 0x5085FEF: handle_requests (in /usr/lib64/
> libanl-2.18.so)
> >>>> ==18504==    by 0x3E90E07EE4: start_thread (in /usr/lib64/
> libpthread-2.18.so)
> >>>> ==18504==    by 0x3E902F4B8C: clone (in /usr/lib64/libc-2.18.so)
> >>>>
> >>>> My understanding with nanomsg 0.2 was that I need these with REQ/REP:
> >>>>
> >>>> server:
> >>>> initialization: nn_socket, nn_bind
> >>>> in the handler loop: nn_recv[msg] + nn_freemsg on the incoming
> message, then  nn_send[msg]
> >>>> to the client
> >>>> when quitting: nn_close
> >>>>
> >>>> client (per REQ/REP message exchange):
> >>>> nn_socket, nn_connect, nn_send[msg], nn_recv[msg], nn_close
> >>>>
> >>>> Do I need to nn_close() the socket on the server side or anything else
> >>>> after the reply was sent?
> >>>>
> >>>> Thanks in advance,
> >>>> Zoltán Böszörményi
> >>>>
> >>
>
>
>


-- 
P THINK BEFORE PRINTING: is it really necessary?

This e-mail and its attachments are confidential and solely for the
intended addressee(s). Do not share or use them without approval. If
received in error, contact the sender
and delete them.

Follow-Ups:
- [nanomsg] Re: getaddrinfo_a() related memory leak and another issue was: Re: Re: What has changed since 0.2 in socket handling?
  - From: Boszormenyi Zoltan

References:
- [nanomsg] Re: getaddrinfo_a() related memory leak and another issue was: Re: Re: What has changed since 0.2 in socket handling?
  - From: Boszormenyi Zoltan

[nanomsg] Re: getaddrinfo_a() related memory leak and another issue was: Re: Re: What has changed since 0.2 in socket handling?

Other related posts: