[nanomsg] Re: getaddrinfo_a() related memory leak and another issue was: Re: Re: What has changed since 0.2 in socket handling?

  • From: Boszormenyi Zoltan <zboszor@xxxxx>
  • To: nanomsg@xxxxxxxxxxxxx
  • Date: Wed, 21 Jan 2015 15:20:19 +0100

Hi,

when can I get a review on https://github.com/nanomsg/nanomsg/pull/356 ?

I can't believe this leak only happens on Fedora 20 and 21.
At least some other Linuxes should show the same problem.

Thanks in advance,
Zoltán Böszörményi

2014-12-21 07:20 keltezéssel, Boszormenyi Zoltan írta:
> Hi,
>
> if you read the starting mail of this thread, you can see
> a memory leak reported by Valgrind. Your reply was at
>
> //www.freelists.org/post/nanomsg/What-has-changed-since-02-in-socket-handling,1
>
> and wondered about the nature of the leak, i.e. whether
> it's in GLIBC or nanomsg.
>
> The number of memory blocks leaked equals to the number of
> getaddrinfo_a() calls and it can simply be plugged by calling
> freeadrinfo() as in the attached patch.
>
> It became somewhat obvious after  reading the example
> in the getaddrinfo() man page that you need to call freeaddrinfo()
> on the result. But it's not done in src/transports/utils/dns_getaddrinfo_a.inc
> at the moment and the getaddrinfo_a() man page doesn't
> explicitly say you need to freeaddrinfo(->ar_result), it only says
> "The elements of this structure correspond to the arguments of getaddrinfo(3).
>  ...
>  Finally, ar_result corresponds to the res argument; you do not need to 
> initialize this ele‐
>  ment, it will be automatically set when the request is resolved.
>  ...
> "
>
> Yesterday, I have tried disabling getaddrinfo_a() detection in configure.ac
> to see whether it leaks the same way. To my surprise, I got an
>
> Assertion failed: reply && !reply->ai_next 
> (src/transports/utils/dns_getaddrinfo.inc:112)
>
> when trying to nn_connect() to localhost. It turned out that GLIBC
> returns the resolved 127.0.0.1 twice, both for getaddrinfo and getaddrinfo_a.
> I haven't looked at the differences of the two returned structures
> but there are indeed valid cases when more than one addresses
> are returned, e.g.:
>
> $ host www.kernel.org
> www.kernel.org is an alias for pub.all.kernel.org.
> pub.all.kernel.org has address 149.20.4.69
> pub.all.kernel.org has address 198.145.20.140
> pub.all.kernel.org has address 199.204.44.194
> pub.all.kernel.org has IPv6 address 2001:4f8:1:10:0:1991:8:25
>
> Considering this, the nn_assert() on line 112 in
> src/transports/utils/dns_getaddrinfo.inc is misguided.
>
> Best regards,
> Zoltán Böszörményi
>
> 2014-12-20 20:04 keltezéssel, Boszormenyi Zoltan írta:
>> Hi again,
>>
>> 2014-11-29 08:27 keltezéssel, Boszormenyi Zoltan írta:
>>> Hi,
>>>
>>> sorry for not replying your answer but I just re-subscribed recently
>>> and I didn't receive the answer from the mailing list.
>>>
>>> I sent the test program in private that integrated networking
>>> into a GLIB mainloop. The real code we use allows switching
>>> between ZeroMQ 3 (3.2.4, to be exact) and nanomsg at
>>> configure time and uses static inline wrappers and #define's
>>> for this reason. We only use the REP/REP pattern at the moment.
>>>
>>> The currently attached test programs (obvious ones, really)
>>> do exhibit the same problem I described in the first mail on
>>> Fedora 20 and 21. Messaging stops after a few (2 to 8) thousand
>>> messages.
>> the last commit "Fix locking bug in nn_global_submit_statistics()"
>> has fixed the lockup problem for REP/REQ.
>>
>> Thanks!
>>
>>> Similar code (or the wrapper API with GLIB mainloop integration)
>>> that uses ZeroMQ didn't stop, I have run one test during the night
>>> and after about 72 million packets, the program still runs stable
>>> and without any leaks. Again, on ZeroMQ 3.2.4.
>>>
>>> Regarding the closed sockets in TIME_WAIT state, I noticed that
>>> they slow down ZeroMQ, too, but don't make it lock up. Setting
>>> these sysctl variables help eliminating the slowdown by instructing
>>> the kernel to reuse those sockets more aggressively:
>>>
>>> net.ipv4.tcp_tw_recycle = 1
>>> net.ipv4.tcp_tw_reuse = 1
>>>
>>> Unfortunately, this didn't help nanomsg.
>>>
>>> Best regards,
>>> Zoltán Böszörményi
>>>
>>> 2014-11-21 21:46 keltezéssel, Boszormenyi Zoltan írta:
>>>> Hi,
>>>>
>>>> I use nanomsg with a wrapper library that integrates the networking
>>>> request-response pattern into the GLIB mainloop via
>>>> nn_getsockopt(NN_SOL_SOCKET, NN_RCVFD).
>>>>
>>>> IIRC, it worked well and without any leaks back then with nanomsg 0.2-ish.
>>>>
>>>> Now, I have upgraded to 0.5 and e.g. on Fedora 20 and 21, my example
>>>> programs lock up after some time. netstat shows there are many sockets
>>>> in TIME_WAIT state even after both te client and server programs have quit.
>>>>
>>>> Also, this memory leak was observed on both Fedora 20 and 21:
>>>>
>>>> ==18504== 43,776 (21,888 direct, 21,888 indirect) bytes in 342 blocks are 
>>>> definitely lost
>>>> in loss record 3,232 of 3,232
>>>> ==18504==    at 0x4A0645D: malloc (in 
>>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so)
>>>> ==18504==    by 0x3E902DA99C: gaih_inet (in /usr/lib64/libc-2.18.so)
>>>> ==18504==    by 0x3E902DE38C: getaddrinfo (in /usr/lib64/libc-2.18.so)
>>>> ==18504==    by 0x5085FEF: handle_requests (in /usr/lib64/libanl-2.18.so)
>>>> ==18504==    by 0x3E90E07EE4: start_thread (in 
>>>> /usr/lib64/libpthread-2.18.so)
>>>> ==18504==    by 0x3E902F4B8C: clone (in /usr/lib64/libc-2.18.so)
>>>>
>>>> My understanding with nanomsg 0.2 was that I need these with REQ/REP:
>>>>
>>>> server:
>>>> initialization: nn_socket, nn_bind
>>>> in the handler loop: nn_recv[msg] + nn_freemsg on the incoming message, 
>>>> then  nn_send[msg]
>>>> to the client
>>>> when quitting: nn_close
>>>>
>>>> client (per REQ/REP message exchange):
>>>> nn_socket, nn_connect, nn_send[msg], nn_recv[msg], nn_close
>>>>
>>>> Do I need to nn_close() the socket on the server side or anything else
>>>> after the reply was sent?
>>>>
>>>> Thanks in advance,
>>>> Zoltán Böszörményi
>>>>
>>


Other related posts: