[py-lmdb] Re: py-lmdb write performance

  • From: David Wilson <dw@xxxxxxxx>
  • To: py-lmdb@xxxxxxxxxxxxx
  • Date: Tue, 27 May 2014 13:18:33 +0000

Hi Dinesh,

Your "divide and conquer" approach sounds interesting. In fact, assuming
the merge step is literally just combining the partitions into one
master database without any extra processing, LMDB includes a special
'append' mode that would speed this operation up.

A nice side effect of this approach is that the final database becomes
optimally packed in the merge step, since it is written sequentially.

Perhaps something like:

    def sorted_union(i1, i2):
        i1 = iter(i1)
        i2 = iter(i2)
        e1 = next(i1, None)
        e2 = next(i2, None)
        while e1 and e2:
            if e1 <= e2:
                yield e1
                e1 = next(i1, None)
            else:
                yield e2
                e2 = next(i2, None)

        for elem, it in (e1, i1), (e2, i2):
            if e:
                yield e
            for elem in it:
                yield elem

    def iterate_remote_db(num):
        """Do whatever necessary to call Cursor.iternext() on the remote
        database, returning an iterable of (key, value) pairs"""

    # Build a recursive union of all the cursor iterators
    merged = iter_local_db()
    for num in range(NUM_REMOTE_DBS):
        merged = sorted_union(merged, iterate_remote_db(num))

    # Write sequentially to the final DB
    with master_env.begin(write=True) as txn:
        curs = txn.cursor()
        curs.putmulti(merged, append=True)


David

On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote:
> The problem to solve is to create a very large db (> 1tb) of synthetic data
> using a cluster of machines.  Once created, the db will be accessed by one
> machine only for predominantly read-only use.  The filesystem is network
> attached.
> 
> One method to create the db is for each machine to create a dictionary of
> data and save it on the filesystem - this is pretty fast.  Next, get one
> machine (only) to write each dictionary data to the db.  One machine writing
> to lmdb on filesystem across a network should be okay but slow - yes?



Other related posts: