Generally, yes, protection domain crossing memcpys are where it is cheaper and easier to copy. Within the same address space, using ownership / reference passing is easier. In fact, mangos does just this to avoid pointless data copying. But more than that, I found that what *really* helped was optimizing to eliminate pressure on the memory allocator and garbage collection. (In this case maintaining my own cache of message objects.) Having a version of the send and receive routines that could be passed a reference, complete with a deallocation function / destructor, would not be a bad way eliminate the data copies. Again, this optimization can be performed relatively painlessly later — I would first do some profiling to demonstrate that the change was necessary based upon measurement, before adding the complexity up front, though. - Garrett > On Oct 29, 2014, at 3:52 AM, Alex Elsayed <eternaleye@xxxxxxxxx > <mailto:eternaleye@xxxxxxxxx>> wrote: > > Matthew Hall wrote: > >> On Tue, Oct 28, 2014 at 05:57:10PM -0700, Garrett D'Amore wrote: >>> Bluntly. I think you may be suffering from premature optimization. >>> >>> Getting to tens of gigabits per second isn't that hard modern hardware. >>> >>> Profile your app and check to see where it is spending time. >>> >>> It may be cheaper to throw a little more hardware at the problem and >>> parallelize than to try extraordinary measures like a user space tcp >>> stack. >>> >>> Sent from my iPhone >> >> I didn't perform any of the optimizations yet. I was just showing a >> practical example of the kind of issues I can run into using these >> different hunks of code together. >> >> I can tell you that on a previous project similar to this one, where all >> the data was getting memcpy'ed between one half of the TCP/IP stack and >> the other in a similar environment, removing the unneeded memcpy's gave a >> 50% boost. >> >> But that environment also memcpy'd a higher percentage of the traffic than >> this one (necessarily) would. >> >> Regarding user space TCP/IP, I can tell you from past experience there was >> no way to get close to the top level of performance I eventually want to >> have without it. > > Linux has recently seen a number of improvements to drastically reduce the > overhead of networking; it may be worth looking up the LWN article about > 'xmit_more' - one of the tests was, in fact, generating wire-rate traffic on > 10gig. > > Also, it seems like what you mean by 'zero-copy' is not _quite_ the same as > what's more commonly meant. Usually, zero-copy is referred to not making > copies when _crossing some sort of protection domain_ - address space (shmem > for IPC), network (RDMA), etc. > > Here, you're referring to crossing _layers_ inside of your own address > space, which doesn't really have a term I've seen used consistently, but > often shows up under the banner of ownership-based handling of data - where > rather than giving an API _access_ to a piece of data, you give it > _ownership_ of that data. > > The Rust language is basically designed around that idea. > > The points people have been bringing up against it - zero-copy not being > worth it under 512K, etc - are mostly from the crossing-protection-domains > type.