Hi, when can I get a review on https://github.com/nanomsg/nanomsg/pull/356 ? I can't believe this leak only happens on Fedora 20 and 21. At least some other Linuxes should show the same problem. Thanks in advance, Zoltán Böszörményi 2014-12-21 07:20 keltezéssel, Boszormenyi Zoltan írta: > Hi, > > if you read the starting mail of this thread, you can see > a memory leak reported by Valgrind. Your reply was at > > //www.freelists.org/post/nanomsg/What-has-changed-since-02-in-socket-handling,1 > > and wondered about the nature of the leak, i.e. whether > it's in GLIBC or nanomsg. > > The number of memory blocks leaked equals to the number of > getaddrinfo_a() calls and it can simply be plugged by calling > freeadrinfo() as in the attached patch. > > It became somewhat obvious after reading the example > in the getaddrinfo() man page that you need to call freeaddrinfo() > on the result. But it's not done in src/transports/utils/dns_getaddrinfo_a.inc > at the moment and the getaddrinfo_a() man page doesn't > explicitly say you need to freeaddrinfo(->ar_result), it only says > "The elements of this structure correspond to the arguments of getaddrinfo(3). > ... > Finally, ar_result corresponds to the res argument; you do not need to > initialize this ele‐ > ment, it will be automatically set when the request is resolved. > ... > " > > Yesterday, I have tried disabling getaddrinfo_a() detection in configure.ac > to see whether it leaks the same way. To my surprise, I got an > > Assertion failed: reply && !reply->ai_next > (src/transports/utils/dns_getaddrinfo.inc:112) > > when trying to nn_connect() to localhost. It turned out that GLIBC > returns the resolved 127.0.0.1 twice, both for getaddrinfo and getaddrinfo_a. > I haven't looked at the differences of the two returned structures > but there are indeed valid cases when more than one addresses > are returned, e.g.: > > $ host www.kernel.org > www.kernel.org is an alias for pub.all.kernel.org. > pub.all.kernel.org has address 149.20.4.69 > pub.all.kernel.org has address 198.145.20.140 > pub.all.kernel.org has address 199.204.44.194 > pub.all.kernel.org has IPv6 address 2001:4f8:1:10:0:1991:8:25 > > Considering this, the nn_assert() on line 112 in > src/transports/utils/dns_getaddrinfo.inc is misguided. > > Best regards, > Zoltán Böszörményi > > 2014-12-20 20:04 keltezéssel, Boszormenyi Zoltan írta: >> Hi again, >> >> 2014-11-29 08:27 keltezéssel, Boszormenyi Zoltan írta: >>> Hi, >>> >>> sorry for not replying your answer but I just re-subscribed recently >>> and I didn't receive the answer from the mailing list. >>> >>> I sent the test program in private that integrated networking >>> into a GLIB mainloop. The real code we use allows switching >>> between ZeroMQ 3 (3.2.4, to be exact) and nanomsg at >>> configure time and uses static inline wrappers and #define's >>> for this reason. We only use the REP/REP pattern at the moment. >>> >>> The currently attached test programs (obvious ones, really) >>> do exhibit the same problem I described in the first mail on >>> Fedora 20 and 21. Messaging stops after a few (2 to 8) thousand >>> messages. >> the last commit "Fix locking bug in nn_global_submit_statistics()" >> has fixed the lockup problem for REP/REQ. >> >> Thanks! >> >>> Similar code (or the wrapper API with GLIB mainloop integration) >>> that uses ZeroMQ didn't stop, I have run one test during the night >>> and after about 72 million packets, the program still runs stable >>> and without any leaks. Again, on ZeroMQ 3.2.4. >>> >>> Regarding the closed sockets in TIME_WAIT state, I noticed that >>> they slow down ZeroMQ, too, but don't make it lock up. Setting >>> these sysctl variables help eliminating the slowdown by instructing >>> the kernel to reuse those sockets more aggressively: >>> >>> net.ipv4.tcp_tw_recycle = 1 >>> net.ipv4.tcp_tw_reuse = 1 >>> >>> Unfortunately, this didn't help nanomsg. >>> >>> Best regards, >>> Zoltán Böszörményi >>> >>> 2014-11-21 21:46 keltezéssel, Boszormenyi Zoltan írta: >>>> Hi, >>>> >>>> I use nanomsg with a wrapper library that integrates the networking >>>> request-response pattern into the GLIB mainloop via >>>> nn_getsockopt(NN_SOL_SOCKET, NN_RCVFD). >>>> >>>> IIRC, it worked well and without any leaks back then with nanomsg 0.2-ish. >>>> >>>> Now, I have upgraded to 0.5 and e.g. on Fedora 20 and 21, my example >>>> programs lock up after some time. netstat shows there are many sockets >>>> in TIME_WAIT state even after both te client and server programs have quit. >>>> >>>> Also, this memory leak was observed on both Fedora 20 and 21: >>>> >>>> ==18504== 43,776 (21,888 direct, 21,888 indirect) bytes in 342 blocks are >>>> definitely lost >>>> in loss record 3,232 of 3,232 >>>> ==18504== at 0x4A0645D: malloc (in >>>> /usr/lib64/valgrind/vgpreload_memcheck-amd64-linux.so) >>>> ==18504== by 0x3E902DA99C: gaih_inet (in /usr/lib64/libc-2.18.so) >>>> ==18504== by 0x3E902DE38C: getaddrinfo (in /usr/lib64/libc-2.18.so) >>>> ==18504== by 0x5085FEF: handle_requests (in /usr/lib64/libanl-2.18.so) >>>> ==18504== by 0x3E90E07EE4: start_thread (in >>>> /usr/lib64/libpthread-2.18.so) >>>> ==18504== by 0x3E902F4B8C: clone (in /usr/lib64/libc-2.18.so) >>>> >>>> My understanding with nanomsg 0.2 was that I need these with REQ/REP: >>>> >>>> server: >>>> initialization: nn_socket, nn_bind >>>> in the handler loop: nn_recv[msg] + nn_freemsg on the incoming message, >>>> then nn_send[msg] >>>> to the client >>>> when quitting: nn_close >>>> >>>> client (per REQ/REP message exchange): >>>> nn_socket, nn_connect, nn_send[msg], nn_recv[msg], nn_close >>>> >>>> Do I need to nn_close() the socket on the server side or anything else >>>> after the reply was sent? >>>> >>>> Thanks in advance, >>>> Zoltán Böszörményi >>>> >>