Hi Dinesh, Your "divide and conquer" approach sounds interesting. In fact, assuming the merge step is literally just combining the partitions into one master database without any extra processing, LMDB includes a special 'append' mode that would speed this operation up. A nice side effect of this approach is that the final database becomes optimally packed in the merge step, since it is written sequentially. Perhaps something like: def sorted_union(i1, i2): i1 = iter(i1) i2 = iter(i2) e1 = next(i1, None) e2 = next(i2, None) while e1 and e2: if e1 <= e2: yield e1 e1 = next(i1, None) else: yield e2 e2 = next(i2, None) for elem, it in (e1, i1), (e2, i2): if e: yield e for elem in it: yield elem def iterate_remote_db(num): """Do whatever necessary to call Cursor.iternext() on the remote database, returning an iterable of (key, value) pairs""" # Build a recursive union of all the cursor iterators merged = iter_local_db() for num in range(NUM_REMOTE_DBS): merged = sorted_union(merged, iterate_remote_db(num)) # Write sequentially to the final DB with master_env.begin(write=True) as txn: curs = txn.cursor() curs.putmulti(merged, append=True) David On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote: > The problem to solve is to create a very large db (> 1tb) of synthetic data > using a cluster of machines. Once created, the db will be accessed by one > machine only for predominantly read-only use. The filesystem is network > attached. > > One method to create the db is for each machine to create a dictionary of > data and save it on the filesystem - this is pretty fast. Next, get one > machine (only) to write each dictionary data to the db. One machine writing > to lmdb on filesystem across a network should be okay but slow - yes?