[py-lmdb] Re: py-lmdb write performance

  • From: David Wilson <dw@xxxxxxxx>
  • To: py-lmdb@xxxxxxxxxxxxx
  • Date: Tue, 27 May 2014 14:17:58 +0000

Eek! What binding version / Python version / OS?

That is very broken.


David

On Tue, May 27, 2014 at 06:58:48AM -0700, Dinesh Vadhia wrote:
> Do you see what is wrong with this put code?
> 
> def put(env, db, append=False, key, value):
>    with env.begin(db, write=True) as txn:
>        txn.put(key, value, append=False)
>    return
> 
> put(env=env, db=db, append=True, key='a', value='b')
> 
> TypeError: put() got an unexpected keyword argument 'append'
> 
> 
> 
> --------------------------------------------------
> From: "David Wilson" <dw@xxxxxxxx>
> Sent: Tuesday, May 27, 2014 6:18 AM
> To: <py-lmdb@xxxxxxxxxxxxx>
> Subject: [py-lmdb] Re: py-lmdb write performance
> 
> >Hi Dinesh,
> >
> >Your "divide and conquer" approach sounds interesting. In fact, assuming
> >the merge step is literally just combining the partitions into one
> >master database without any extra processing, LMDB includes a special
> >'append' mode that would speed this operation up.
> >
> >A nice side effect of this approach is that the final database becomes
> >optimally packed in the merge step, since it is written sequentially.
> >
> >Perhaps something like:
> >
> >   def sorted_union(i1, i2):
> >       i1 = iter(i1)
> >       i2 = iter(i2)
> >       e1 = next(i1, None)
> >       e2 = next(i2, None)
> >       while e1 and e2:
> >           if e1 <= e2:
> >               yield e1
> >               e1 = next(i1, None)
> >           else:
> >               yield e2
> >               e2 = next(i2, None)
> >
> >       for elem, it in (e1, i1), (e2, i2):
> >           if e:
> >               yield e
> >           for elem in it:
> >               yield elem
> >
> >   def iterate_remote_db(num):
> >       """Do whatever necessary to call Cursor.iternext() on the remote
> >       database, returning an iterable of (key, value) pairs"""
> >
> >   # Build a recursive union of all the cursor iterators
> >   merged = iter_local_db()
> >   for num in range(NUM_REMOTE_DBS):
> >       merged = sorted_union(merged, iterate_remote_db(num))
> >
> >   # Write sequentially to the final DB
> >   with master_env.begin(write=True) as txn:
> >       curs = txn.cursor()
> >       curs.putmulti(merged, append=True)
> >
> >
> >David
> >
> >On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote:
> >>The problem to solve is to create a very large db (> 1tb) of synthetic
> >>data
> >>using a cluster of machines.  Once created, the db will be accessed by
> >>one
> >>machine only for predominantly read-only use.  The filesystem is network
> >>attached.
> >>
> >>One method to create the db is for each machine to create a dictionary of
> >>data and save it on the filesystem - this is pretty fast.  Next, get one
> >>machine (only) to write each dictionary data to the db.  One machine
> >>writing
> >>to lmdb on filesystem across a network should be okay but slow - yes?
> >
> >
> >
> >
> 

Other related posts: