[py-lmdb] Re: py-lmdb write performance

  • From: "Dinesh Vadhia" <dineshvadhia@xxxxxxxxxxx>
  • To: <py-lmdb@xxxxxxxxxxxxx>
  • Date: Tue, 27 May 2014 08:33:24 -0700

Oh!, Maybe there is a problem:

Reading a dictionary (d) and writing key:value pair to db with:

for key, value in d.items():
   with env.begin(db, write=True) as txn:
       txn.put(key=key, value=value, append=True)

and reading it back with:

   with env.begin(db) as txn:
       value = txn.get(key)

But, the value = None for all keys. The dictionary has been checked and it has valid keys and values.

Using Python 2.7.5 on a Windows box.

What am I missing?


--------------------------------------------------
From: "David Wilson" <dw@xxxxxxxx>
Sent: Tuesday, May 27, 2014 7:17 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: [py-lmdb] Re: py-lmdb write performance

Eek! What binding version / Python version / OS?

That is very broken.


David

On Tue, May 27, 2014 at 06:58:48AM -0700, Dinesh Vadhia wrote:
Do you see what is wrong with this put code?

def put(env, db, append=False, key, value):
   with env.begin(db, write=True) as txn:
       txn.put(key, value, append=False)
   return

put(env=env, db=db, append=True, key='a', value='b')

TypeError: put() got an unexpected keyword argument 'append'



--------------------------------------------------
From: "David Wilson" <dw@xxxxxxxx>
Sent: Tuesday, May 27, 2014 6:18 AM
To: <py-lmdb@xxxxxxxxxxxxx>
Subject: [py-lmdb] Re: py-lmdb write performance

>Hi Dinesh,
>
>Your "divide and conquer" approach sounds interesting. In fact, assuming
>the merge step is literally just combining the partitions into one
>master database without any extra processing, LMDB includes a special
>'append' mode that would speed this operation up.
>
>A nice side effect of this approach is that the final database becomes
>optimally packed in the merge step, since it is written sequentially.
>
>Perhaps something like:
>
>   def sorted_union(i1, i2):
>       i1 = iter(i1)
>       i2 = iter(i2)
>       e1 = next(i1, None)
>       e2 = next(i2, None)
>       while e1 and e2:
>           if e1 <= e2:
>               yield e1
>               e1 = next(i1, None)
>           else:
>               yield e2
>               e2 = next(i2, None)
>
>       for elem, it in (e1, i1), (e2, i2):
>           if e:
>               yield e
>           for elem in it:
>               yield elem
>
>   def iterate_remote_db(num):
>       """Do whatever necessary to call Cursor.iternext() on the remote
>       database, returning an iterable of (key, value) pairs"""
>
>   # Build a recursive union of all the cursor iterators
>   merged = iter_local_db()
>   for num in range(NUM_REMOTE_DBS):
>       merged = sorted_union(merged, iterate_remote_db(num))
>
>   # Write sequentially to the final DB
>   with master_env.begin(write=True) as txn:
>       curs = txn.cursor()
>       curs.putmulti(merged, append=True)
>
>
>David
>
>On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote:
>>The problem to solve is to create a very large db (> 1tb) of synthetic
>>data
>>using a cluster of machines.  Once created, the db will be accessed by
>>one
>>machine only for predominantly read-only use. The filesystem is >>network
>>attached.
>>
>>One method to create the db is for each machine to create a dictionary >>of >>data and save it on the filesystem - this is pretty fast. Next, get >>one
>>machine (only) to write each dictionary data to the db.  One machine
>>writing
>>to lmdb on filesystem across a network should be okay but slow - yes?
>
>
>
>




Other related posts: