[nanomsg] Re: poll: new allocator for consecutive chunks

From: Ioannis Charalampidis <ioannis.charalampidis@xxxxxxx>
To: <nanomsg@xxxxxxxxxxxxx>
Date: Thu, 14 Apr 2016 17:47:19 +0200

Hello Garret, and thank you for your complete reply :)

I think the latter is more likely to be useful and efficient. If you have very specific memory allocation needs, then I think the API consumer should provide memory regions. In the Solaris/illumos kernel we have a technique like this for mblks called “esballoc” — http://www.unix.com/man-page/opensolaris/9f/esballoc/ — all the high performance networking NIC card use this.

Gocha. And as a matter of fact, I don't intend of exposing this API to the consumers (yet?). My intention is only to use it in the transport in order to effectively allocate multiple receive buffers (that will be moved eventually to the user).

I’m a little concerned here about how the final API is going to look for API consumers. I think its really important to have a clean API for users of the libnanomsg API, and as design constraints use of optimization features needs to be optional. That is, if an API consumer doesn’t use your fancy allocation scheme, then it should still work, perhaps less efficiently.

My only addition to the API consumers is the ability to send raw pointer in a zero-copy manner (the function nn_allocmsg_ptr). Take for example this use case (that's an actual use case here at CERN) :

// Let's say you have a pointer that you haven't allocated yourself
// like for example a shared memory region with an FPGA DAQ card
void * ptr;
fpga_get_io_buffer( &ptr );

// This buffer is huge (50mb+), so you want to transfer it in a zero-copy manner
// so you need to create a chunk from it's pointer
void * msg = nn_allocmsg_ptr( ptr, 52428800, &no_free, NULL );

// Now send it
nn_send( socket, &msg, NN_MSG, 0 );

Note that nn_allocmsg_ptr also works with other transports too, such as TCP (as a matter of fact I have added a test for TCP and it passes without any problem, have a look here : https://github.com/wavesoft/nanomsg/blob/pull-nn_allocmsg_ptr/tests/chunks.c#L198 ).

Also, this is fully backwards compliant with 'standard' chunks. This is possible because all transport implementations use chunkrefs (since that's the core of the nn_msg structure) and not chunks and therefore I only had to add this : https://github.com/wavesoft/nanomsg/commit/34160b18f833b2fa6dcf9f7cb068abaa8ceb83f8#diff-e07b6f5e4540ba1ca91e5857d87dc963R116 ) - the nn_chunk_deref does all the magic.

Further, there are challenges I think because you have to arrange for the API consumer and the transport to collaborate on the memory allocation scheme. What I mean is, the transport needs to be able to identify that the memory it wants to use is already “registered” (I guess this means that it is mapped for DMA in the system, and registers have been programmed on the device to identify its location by some kind of numeric ID?), so that it can do the right thing. I actually have no idea how you’re going to wind up doing that cleanly — most of my ideas for this wind up looking pretty ugly. A big part of the problem is that we don’t really reveal the transport to the API consumer — to do most of what I think you want to do, you’re going to have to find a solution to break through the abstraction boundary that libnanomsg provides. (And frankly that abstraction boundary is a big percentage of the services that libnanomsg is designed to provide — meaning by doing this you’re sort of negating a sizeable part of the benefit of libnanomsg — particularly the transport independence, and the ability to have many transports participating on a single socket.)

Have a look here : http://openlab.cern/publications/presentations/new-libfabric-based-transport-nanomsg

I’ve said here and in many other places, I highly recommend implementing a simple copy scheme in your transport first, and benchmarking that. Frankly, compared to other operations that libnanomsg does, the copy of your messaging data is unlikely to have a huge performance impact. The exception here would be if your application needs to send huge messages (e.g. >64KB) frequently.

Unfortunately my starting message size is 25MB and this can go up to 1GB, so memory copying is no-go in our case.

I did some benchmarks between nn_allocmsg_ptr (send the same pointer every time) and nn_allocmsg (allocate new chunk before send) and I saw that performance drops considerably with messages bigger than 16Mb (at 128MB I see a 30% drop). If you introduce a memcpy in addition to malloc I guess it will drop even faster...

(libnanomsg has performance issues that come from the extra system calls and file descriptors it uses to provide a poll() and select() compatible notification mechanism — basically it performs a small write(2) to a notification pipe, which has to be poll()’d, then read(). That’s at least three system calls more per message than we would have if we just used a simple synchronous pattern using threads.

I know, but I couldn't re-use nanomsg's workers with libfabric, and therefore I had to implement my own worker, that doesn't use so many FDs. (the limitation comes from libfabric, since it doesn't give you any FD that you can poll upon...)

My custom worker just notifies the FSM directly when an event is received, so I don't use this signaling : https://github.com/wavesoft/nanomsg-transport-ofi/blob/devel-ofiw/src/transports/ofi/ofiw.c#L168

Sadly that problem cannot be fixed without a major rewrite of libnanomsg, and not without discarding the poll() and select() compatible semantic. I don’t see that happening for libnanomsg, ever; its something for a wire compatible alternate implementation.)

An interesting trivia here : libfabric has a 'sockets' provider implementation, so I tried if it's performance is any better than using the native nanomsg 'tcp' transport and as a matter of fact it looks much better (on large message sizes). I won't give any numbers right now until I am 100% sure of my code, but it might be an interesting alternative for high-performance applications :)

If I were implementing a transport like yours, I’d have preallocated and premapped buffers available in the transport, and on TX I’d just copy into one of those; on TX completion I’d recycle that buffer internally. On RX you receive into your registered memory, and then copy into ordinary memory. I’d probably keep a pool of allocated RX buffers (normal memory, not registered with your device) handy to ensure that I could receive without getting stuck behind malloc.

This is more or less what I am doing :) Have a look on the presentation.

Now that said, if you can show that libnanomsg consumers see a notable performance difference justifying the complexity (more likely if your needs are to exchange huge messages, and your transport can move them natively using zero copy DMA or somesuch), then this becomes much more worthwhile.

I will do a presentation at some point, so I will post the numbers when I have them. But even now, with a beta version of the transport the numbers do seem quite good. Note that there is still way to go (ex. I haven't tested all the protocols yet).

Cheers,
Ioannis

On 14/04/16 17:09, Garrett D'Amore wrote:

I think the latter is more likely to be useful and efficient. If you have very specific memory allocation needs, then I think the API consumer should provide memory regions. In the Solaris/illumos kernel we have a technique like this for mblks called “esballoc” — http://www.unix.com/man-page/opensolaris/9f/esballoc/ — all the high performance networking NIC card use this.

I’m a little concerned here about how the final API is going to look for API consumers. I think its really important to have a clean API for users of the libnanomsg API, and as design constraints use of optimization features needs to be optional. That is, if an API consumer doesn’t use your fancy allocation scheme, then it should still work, perhaps less efficiently.

Further, there are challenges I think because you have to arrange for the API consumer and the transport to collaborate on the memory allocation scheme. What I mean is, the transport needs to be able to identify that the memory it wants to use is already “registered” (I guess this means that it is mapped for DMA in the system, and registers have been programmed on the device to identify its location by some kind of numeric ID?), so that it can do the right thing. I actually have no idea how you’re going to wind up doing that cleanly — most of my ideas for this wind up looking pretty ugly. A big part of the problem is that we don’t really reveal the transport to the API consumer — to do most of what I think you want to do, you’re going to have to find a solution to break through the abstraction boundary that libnanomsg provides. (And frankly that abstraction boundary is a big percentage of the services that libnanomsg is designed to provide — meaning by doing this you’re sort of negating a sizeable part of the benefit of libnanomsg — particularly the transport independence, and the ability to have many transports participating on a single socket.)

I’ve said here and in many other places, I highly recommend implementing a simple copy scheme in your transport first, and benchmarking that. Frankly, compared to other operations that libnanomsg does, the copy of your messaging data is unlikely to have a huge performance impact. The exception here would be if your application needs to send huge messages (e.g. >64KB) frequently. (libnanomsg has performance issues that come from the extra system calls and file descriptors it uses to provide a poll() and select() compatible notification mechanism — basically it performs a small write(2) to a notification pipe, which has to be poll()’d, then read(). That’s at least three system calls more per message than we would have if we just used a simple synchronous pattern using threads. Sadly that problem cannot be fixed without a major rewrite of libnanomsg, and not without discarding the poll() and select() compatible semantic. I don’t see that happening for libnanomsg, ever; its something for a wire compatible alternate implementation.)

If I were implementing a transport like yours, I’d have preallocated and premapped buffers available in the transport, and on TX I’d just copy into one of those; on TX completion I’d recycle that buffer internally. On RX you receive into your registered memory, and then copy into ordinary memory. I’d probably keep a pool of allocated RX buffers (normal memory, not registered with your device) handy to ensure that I could receive without getting stuck behind malloc.

On a system with DTrace, you might even be able to deep probe to see whether the bulk of of your time is spent doing the bcopy(), or if that falls into the noise compared to e.g. the system calls. (I’m not sure what other introspection tools might be available on Linux or Windows. I’m mostly an illumos guy, though I use a Mac on the desktop.)

Now that said, if you can show that libnanomsg consumers see a notable performance difference justifying the complexity (more likely if your needs are to exchange huge messages, and your transport can move them natively using zero copy DMA or somesuch), then this becomes much more worthwhile.

  - Garrett

On Thu, Apr 14, 2016 at 6:03 AM, Ioannis Charalampidis <ioannis.charalampidis@xxxxxxx <mailto:ioannis.charalampidis@xxxxxxx>> wrote:

    Hi all!

    I am going to add an additional feature to the chunks core and I
    wanted your feedback in order to chose what would benefit
    everyone. In my case I will need to allocate a series of
    consecutive chunks for optimization reasons (fewer memory
    registrations) and since this is not currently supported I have
    two solutions planned for implementation :

    (1) Shall I introduce a high-level function: nn_chunk_alloc_many(
    size_t size, int type, int count, void*** chunks ) that allocates
    a number of consecutive chunks using the allocation type specified?

    Example:

    void * chunks[4];
    nn_chunk_alloc_many( 1024, NN_ALLOC_PAGEALIGN, 4, &chunks );

    // .. use them as chunks ..

    // Free them
    nn_chunk_free( chunks[0] );
    nn_chunk_free( chunks[1] );
    nn_chunk_free( chunks[2] );
    nn_chunk_free( chunks[3] );

    Pros:

      * Very simple and straightforward API from user's PoV

    Cons:

      * This requires additional reference tracking and custom
        de-allocator functions in order to wait for all chunks to be
        free'd before the actual memory region is released, but that's
        easily managed.
      * In order to implement the memory registration I will need to
        know the buffer base address and overall size (that in case of
        memory alignment won't be equal to size * count), therefore
        introducing a kind of ugly optional 5h parameter ( struct
        nn_chunk_meta * meta ) that will be used to track such
        information.
      * If more fine-grained control is required it's difficult to
        access the implementation internals without hacking it (btw,
        this is something that I have been fighting with until I
        decided to actually touch the chunk code myself).

    (2) Or shall I introduce a lower-level function : nn_chunk_init(
    void * ptr, size_t ptr_size, nn_chunk_free_fn destructor, void *
    userptr, void ** chunk ) that initializes a chunk structure to a
    given buffer? In this case the user should allocate the
    consecutive buffer and then call this function to initialize parts
    of it as chunks.

    Example:

    size_t chunk_size = 1024 + nn_chunk_hdrsize();
    void * memory = aligned_alloc( sysconf(_SC_PAGESIZE), chunk_size *
    4 );

    // Create chunks
    void * chunks[4];
    void * ptr = memory;
    for (int i=0; i<4; i++) {
    nn_chunk_init( ptr, chunk_size, &free_fn, NULL, &chunks[i] );
       ptr = ((uint8_t*)ptr) + chunk_size;
    }

    // .. use them ..

    // Free chunks (calls the given free function, doesn't free anything)
    nn_chunk_free( chunks[0] );
    nn_chunk_free( chunks[1] );
    nn_chunk_free( chunks[2] );
    nn_chunk_free( chunks[3] );

    // User needs to free memory eventually, or needs to
    // implement the high-level logic mentioned before to
    // free memory when last chunk is freed
    free( memory );

    Pros:

      * No need to implement the reference tracking, which keeps the
        chunk core cleaner
      * The custom free function can be used to track
        implementation-specific logic (ex. mark buffer as free for re-use)
      * No need to track the individual chunks when it's time for
        clean-up, just free the allocated memory.

    Cons:

      * The user needs to do the memory management.
      * Very similar API to nn_chunk_alloc_ptr that might introduce
        confusion. The difference is that the latter just creates a
        const pointer-chunk to the user data, while the former assumes
        the data given is a chunk and initializes it as such (writes a
        chunk header and returns a pointer to the chunk data).

    (3) Or shall I implement both solutions?

    Looking forward to your comment/choice!

    Cheers,
    Ioannis

Follow-Ups:
- [nanomsg] Re: poll: new allocator for consecutive chunks
  - From: Ioannis Charalampidis
- [nanomsg] Re: poll: new allocator for consecutive chunks
  - From: Garrett D'Amore

References:
- [nanomsg] poll: new allocator for consecutive chunks
  - From: Ioannis Charalampidis
- [nanomsg] Re: poll: new allocator for consecutive chunks
  - From: Garrett D'Amore

[nanomsg] Re: poll: new allocator for consecutive chunks

Other related posts: