[py-lmdb] Re: py-lmdb write performance

  • From: "Dinesh Vadhia" <dineshvadhia@xxxxxxxxxxx>
  • To: <py-lmdb@xxxxxxxxxxxxx>
  • Date: Thu, 29 May 2014 07:46:29 -0700

I don't know if the performance is an lmdb or a cluster issue. Here are key answers:

- Using a high performance centos-6 based cluster with infiniband, fast disks and with sole access to these machines on cluster.
- It takes ~20secs to generate a ~1gb dictionary on each machine
- Next, one machine writes each dictionary data to lmdb on filesystem across network which takes ~2.5 hours per dictionary.

Attached are output for dirtybench.py from Windows and Linux.

Best ...

From: "David Wilson" <dw@xxxxxxxx>
Sent: Thursday, May 29, 2014 5:25 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: [py-lmdb] Re: Fw: Re: py-lmdb write performance

Hey Dinesh,

On Thu, May 29, 2014 at 02:15:06AM -0700, Dinesh Vadhia wrote:

- on Windows, the data.mdb is created with size=map_size
- on Linux, data.mdb is created with size ~12K irrespective of map_size.
Once db populatation starts then map_size disk space is allocated.

If so, then why is it taking 2.5 hours to write ~1gb of data?

Are you using writemap=True? On Linux and Windows, writemap=True relies
on sparse files. The "size" reported by Windows/Linux may be confusing,
depending on which tool you use, as some will report "allocated size"
and others "logical size".

E.g. the Linux "du" command will report allocated size, while the "ls"
command will report logical size.

IIRC Windows explorer will report logical size, whereas DOS "dir"
command will report allocated size.

Additionally even when writemap=False, LMDB holds some state in memory
while a transaction is alive, and so the disk usage may not be reflected
until .commit() is called.

Regarding your performance problem, you are not providing nearly enough
information for me to help you.
   * What OS?
   * What filesystem?
   * What host machine?
   * Is it a VM?
   * Have you run your script under 'cProfile --sort=cum' and verified
     LMDB is really the cause?
   * Are you still using a network filesystem? We already know that is
   * Does your job start fast, and then slow down? If so, is your
     dataset larger than RAM?
   * What kind of disks are you writing to?
   * Are there any other users of the machine that might cause it to be
   * How large are your transactions? (how many records / how many GB).
   * Have you tried splitting your writes into smaller txns?

I have updated examples/dirtybench.py in the Git repository to work on
Windows. Please provide dirtybench.py output from your host environment
where the slowness is being observed.


From: "Dinesh Vadhia" <dineshvadhia@xxxxxxxxxxx>
Sent: Wednesday, May 28, 2014 10:15 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: Re: [py-lmdb] Re: py-lmdb write performance

>Looks like the (centos-based) cluster is not creating the lmdb db with >the
>correct map_size.  A small map_size works fine but larger ones (eg. >
>30gb) creates a 12K db !  Not sure what is going on but the admins are
>looking into it.  Not sure if it is an lmdb or OS problem yet.
>From: "David Wilson" <dw@xxxxxxxx>
>Sent: Wednesday, May 28, 2014 5:57 AM
>To: <py-lmdb@xxxxxxxxxxxxx>
>Subject: [py-lmdb] Re: py-lmdb write performance
>>That's about 120kb/sec, which doesn't sound right.
>>On Wed, May 28, 2014 at 05:48:47AM -0700, Dinesh Vadhia wrote:
>>>Generated sorted keys to create local dictionaries on each cluster
>>>next, one machine merges each dictionary in sorted order into db; but >>>it
>>>still takes ~2.5 hours to write ~1gb of data with append=True;
>>>doesn't sound right does it or does it?

$ python dirtybench.py
permutate 1876098 words avglen 13 took 2.99sec
DB_PATH: /tmp/dirtybenchUgfwU5
                                 insert:  3.746s     500855/sec

stat: {'branch_pages': 244L, 'leaf_pages': 25385L, 'overflow_pages': 0L,
'psize': 4096L, 'depth': 4L, 'entries': 1876098L}
k+v size 50911.31kb avg 27, on-disk size: 101540.00kb avg 55

                enum (key, value) pairs:  0.340s    5524417/sec
        reverse enum (key, value) pairs:  0.320s    5868640/sec
              enum (key, value) buffers:  0.291s    6437401/sec

                            rand lookup:  3.459s     542330/sec
                    per txn rand lookup:  4.248s     441653/sec
                       rand lookup+hash:  3.526s     532139/sec
                    rand lookup buffers:  3.462s     541905/sec
               rand lookup+hash buffers:  3.469s     540749/sec
           rand lookup buffers (cursor):  3.569s     525735/sec

                                get+put:  5.636s     332890/sec
                                replace:  4.472s     419499/sec

                          insert (rand):  3.852s     487009/sec
                           insert (seq):  2.085s     899720/sec
            insert (rand), reuse cursor:  4.053s     462928/sec
             insert (seq), reuse cursor:  0.995s    1886274/sec
                       insert, putmulti:  3.541s     529896/sec
             insert, putmulti+generator:  3.836s     489081/sec

                                 append:  1.229s    1526228/sec
                   append, reuse cursor:  1.102s    1703103/sec
                        append+putmulti:  0.488s    3842824/sec

stat: {'branch_pages': 117L, 'leaf_pages': 17460L, 'overflow_pages': 0L,
'psize': 4096L, 'depth': 3L, 'entries': 1876098L}
k+v size 50911.31kb avg 27, on-disk size: 69840.00kb avg 38
python dirtybench.py
permutate 1876098 words avglen 13 took 4.49sec
DB_PATH: c:\docume~1\dinesh\locals~1\temp\dirtybenchmpw0nh
                                 insert:  6.141s     305503/sec

stat: {'branch_pages': 246L, 'leaf_pages': 25349L, 'overflow_pages': 0L,
: 4096L, 'depth': 4L, 'entries': 1876098L}
k+v size 50911.31kb avg 27, on-disk size: 101396.00kb avg 55

                enum (key, value) pairs:  0.547s    3429795/sec
        reverse enum (key, value) pairs:  0.469s    4000208/sec
              enum (key, value) buffers:  0.563s    3332323/sec

                            rand lookup:  3.078s     609518/sec
                    per txn rand lookup:  4.406s     425805/sec
                       rand lookup+hash:  3.266s     574432/sec
                    rand lookup buffers:  3.015s     622254/sec
               rand lookup+hash buffers:  3.235s     579937/sec
           rand lookup buffers (cursor):  3.234s     580116/sec

                                get+put:  8.812s     212902/sec
                                replace:  7.375s     254386/sec

                          insert (rand):  7.062s     265661/sec
                           insert (seq):  5.079s     369383/sec
            insert (rand), reuse cursor:  8.625s     217518/sec
             insert (seq), reuse cursor:  5.172s     362741/sec
                       insert, putmulti:  5.968s     314359/sec
             insert, putmulti+generator:  7.109s     263904/sec

                                 append:  4.047s     463577/sec
                   append, reuse cursor:  4.469s     419802/sec
                        append+putmulti:  4.172s     449687/sec

stat: {'branch_pages': 117L, 'leaf_pages': 17442L, 'overflow_pages': 0L,
: 4096L, 'depth': 3L, 'entries': 1876098L}
k+v size 50911.31kb avg 27, on-disk size: 69768.00kb avg 38

Other related posts: