Do you see what is wrong with this put code? def put(env, db, append=False, key, value): with env.begin(db, write=True) as txn: txn.put(key, value, append=False) return put(env=env, db=db, append=True, key='a', value='b') TypeError: put() got an unexpected keyword argument 'append' -------------------------------------------------- From: "David Wilson" <dw@xxxxxxxx> Sent: Tuesday, May 27, 2014 6:18 AM To: <py-lmdb@xxxxxxxxxxxxx> Subject: [py-lmdb] Re: py-lmdb write performance
Hi Dinesh, Your "divide and conquer" approach sounds interesting. In fact, assuming the merge step is literally just combining the partitions into one master database without any extra processing, LMDB includes a special 'append' mode that would speed this operation up. A nice side effect of this approach is that the final database becomes optimally packed in the merge step, since it is written sequentially. Perhaps something like: def sorted_union(i1, i2): i1 = iter(i1) i2 = iter(i2) e1 = next(i1, None) e2 = next(i2, None) while e1 and e2: if e1 <= e2: yield e1 e1 = next(i1, None) else: yield e2 e2 = next(i2, None) for elem, it in (e1, i1), (e2, i2): if e: yield e for elem in it: yield elem def iterate_remote_db(num): """Do whatever necessary to call Cursor.iternext() on the remote database, returning an iterable of (key, value) pairs""" # Build a recursive union of all the cursor iterators merged = iter_local_db() for num in range(NUM_REMOTE_DBS): merged = sorted_union(merged, iterate_remote_db(num)) # Write sequentially to the final DB with master_env.begin(write=True) as txn: curs = txn.cursor() curs.putmulti(merged, append=True) David On Tue, May 27, 2014 at 05:57:53AM -0700, Dinesh Vadhia wrote:The problem to solve is to create a very large db (> 1tb) of synthetic data using a cluster of machines. Once created, the db will be accessed by onemachine only for predominantly read-only use. The filesystem is network attached. One method to create the db is for each machine to create a dictionary of data and save it on the filesystem - this is pretty fast. Next, get onemachine (only) to write each dictionary data to the db. One machine writingto lmdb on filesystem across a network should be okay but slow - yes?