[py-lmdb] Fw: Re: py-lmdb write performance

  • From: "Dinesh Vadhia" <dineshvadhia@xxxxxxxxxxx>
  • To: <py-lmdb@xxxxxxxxxxxxx>
  • Date: Tue, 27 May 2014 07:24:07 -0700

Ok, got the put code working - my arguments were a bit squiffy!

Merging one dictionary at a time with a small test data, "append" reduced time to merge a couple of dictionaries into one db from 140s to 12s !

Best ...
Dinesh

--------------------------------------------------
From: "Dinesh Vadhia" <dineshvadhia@xxxxxxxxxxx>
Sent: Tuesday, May 27, 2014 6:58 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: Re: [py-lmdb] Re: py-lmdb write performance

Do you see what is wrong with this put code?

def put(env, db, append=False, key, value):
   with env.begin(db, write=True) as txn:
       txn.put(key, value, append=False)
   return

put(env=env, db=db, append=True, key='a', value='b')

TypeError: put() got an unexpected keyword argument 'append'



--------------------------------------------------
From: "David Wilson" <dw@xxxxxxxx>
Sent: Tuesday, May 27, 2014 6:18 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: [py-lmdb] Re: py-lmdb write performance

Hi Dinesh,

Your "divide and conquer" approach sounds interesting. In fact, assuming
the merge step is literally just combining the partitions into one
master database without any extra processing, LMDB includes a special
'append' mode that would speed this operation up.

A nice side effect of this approach is that the final database becomes
optimally packed in the merge step, since it is written sequentially.

Perhaps something like:

   def sorted_union(i1, i2):
       i1 = iter(i1)
       i2 = iter(i2)
       e1 = next(i1, None)
       e2 = next(i2, None)
       while e1 and e2:
           if e1 <= e2:
               yield e1
               e1 = next(i1, None)
           else:
               yield e2
               e2 = next(i2, None)

       for elem, it in (e1, i1), (e2, i2):
           if e:
               yield e
           for elem in it:
               yield elem

   def iterate_remote_db(num):
       """Do whatever necessary to call Cursor.iternext() on the remote
       database, returning an iterable of (key, value) pairs"""

   # Build a recursive union of all the cursor iterators
   merged = iter_local_db()
   for num in range(NUM_REMOTE_DBS):
       merged = sorted_union(merged, iterate_remote_db(num))

   # Write sequentially to the final DB
   with master_env.begin(write=True) as txn:
       curs = txn.cursor()
       curs.putmulti(merged, append=True)


David

On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote:
The problem to solve is to create a very large db (> 1tb) of synthetic data using a cluster of machines. Once created, the db will be accessed by one
machine only for predominantly read-only use.  The filesystem is network
attached.

One method to create the db is for each machine to create a dictionary of
data and save it on the filesystem - this is pretty fast.  Next, get one
machine (only) to write each dictionary data to the db. One machine writing
to lmdb on filesystem across a network should be okay but slow - yes?





Other related posts: